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1.  Introduction. 

1.1.  Overview. 

This  study  continues  along  the  same  vein  as  the  work  reported  by  Mulder  and  Tick 
[1],  which  compared  the  relative  performance  of  the  general  purpose  MC68020  available 
from  Motorola  and  the  special  purpose  PLM  developed  at  U.C.  Berkeley  (2.3),  with 
respect  to  the  execution  of  Prolog  programs.  Both  engines  are  based  on  the  Warren 
Abstract  Machine;  the  PLM  directly  executes  WAM  constructs  and  the  MC68020  exe- 
cutes  a  68020  image  that  was  generated  by  macroexpanding  each  WAM  construct. 

The  study  reported  here  compares  a  more  recent  Prolog  processor,  the  Berkeley 
VLSI-PLM  with  the  Motorola  MC68020  across  fourteen  Prolog  benchmarks.  Comparis¬ 
ons  were  made  for  10  MHz  and  16.7  MHz  versions  of  the  VLSI-PLM  on  the  one  hand, 
versus  10  MHz,  12  MHz,  16.7  MHz,  25  MHz,  30  MHz,  and  40  MHz  versions  of  the 
MC68020.  The  comparisons  reflect  the  timing  information  in  the  MC68020  lasers 
Manual  (4]  which  identifies  three  levels  of  performance,  depending  on  whether  instruc¬ 
tions  can  overlap  their  execution,  whether  instruction  accesses  hit  in  the  small  on-chip 
instruction  cache,  or  neither. 
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1.2.  Objectives. 

We  have  performed  ihi.  eludj  for  several  reasODS.  First,  o«r  loag-term  objectives 
are  to  ooderstand  the  relative  performaoce  opportuaities  as  well  as  the  relative 
cost/performance  benefits  associated  with  various  implementations  of  Prolog.  Toward 
that  end,  a  natural  experiment  is  to  compare  a  tailored  special  purpose  Prolog  processor 
to  an  off-the-shelf  general  purpose  part. 

Second,  the  recent  study  referred  to  above  [I],  which  compared  the  Berkeley  PLM 
and  the  general  purpose  MC68020,  raised  certain  concerns  which  have  compelled  us  to 
continue  this  comparison.  For  example,  Mulder  and  Tick  relied  on  a  significance  method 
for  comparing  performance.  We  felt  that  the  additional  work  encountered  in  translating 
all  45  PLM  constructs  to  MC68020  machine  instructions,  rather  than  the  fourteen  they 
translated,  would  produce  more  meaningful  results.  Also,  their  comparison  involved  only 
three  benchmarks.  We  felt  that  the  comparison  would  be  improved  if  the  benchmark  set 
were  expanded.  While  we  still  need  to  work  on  developing  a  proper  set  of  benchmarks, 
we  did  expand  the  set  used  from  three  to  fourteen.  Most  important,  Mulder  and  Tick 
relied  on  their  Lcode  architecture  model  and  their  extensions  to  the  Berkeley  PLM  com¬ 
piler  for  obtaining  measurement,  for  the  Berkeley  PLM.  In  the  case  of  the  architecture 
model,  there  are  substantial  differences  between  the  Lcode  architecture  and  the  PLM.  In 
the  case  of  the  compiler,  the  techniques  used  in  some  instances  were  quite  different.  Since 
*,  have  available  the  Berkeley  PLM  compiler  15]  and  the  various  Berkeley  PLM  Simula, 
tors  16,71,  we  were  able  to  more  accurately  simulate  the  benchmark  set  on  our  special 

purpose  processors. 

Finally,  our  research  in  Prolog  microarchitecture  has  reached  the  point  where  we 
have  designed  our  first  single  chip  implementation,  the  VLSl-PLM,  and  it  seems 
appropriate  to  compare  its  performance  with  that  of  a  single  chip  general  purpose  pro 


cessor. 
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1.3.  Organi*»tion  of  th»  report. 

This  report  is  organited  into  five  sections  and  two  appendices.  Section  2  describes 
the  mapping  between  the  requirements  of  the  Warren  Abstract  Machine  and  its  imple¬ 
mentation  with  the  MC68020.  Our  mapping  is  slightly  different  from  the  one  given  in 
11].  Section  3  describes  the  methodology  used  to  compare  the  two  implementations,  and 
reports  the  results  of  the  comparison.  Various  problems  endemic  to  this  comparison 
method  are  discussed.  In  section  4,  we  analyse  these  results  and  offer  a  number  of  obser¬ 
vations.  In  section  5,  we  summarise  the  results  of  this  study.  Appendix  A  contains  the 
MC68020  machine  language  emulation  for  each  of  the  45  PLM  instructions,  along  with 
the  corresponding  timing  information.  Appendix  B  contains  the  five  tables  which  report 
the  detailed  comparison  data  described  in  Section  3. 

2.  Mapping  the  Warren  Abstract  Machine  onto  the  MC68020. 

Before  we  could  execute  WAM  code  on  the  MC68020,  we  needed  to  identify  each  of 
the  abstract  WAM  instructions  in  the  context  of  real  MC68020  instructions.  We  per¬ 
formed  this  mapping  with  the  goal  of  making  the  MC68020  an  effective  Prolog  processor, 
removing  unnecessary  bottlenecks  where  they  presented  themselves. 

Figure  1  shows  the  mapping  of  the  WAM  registers  into  the  MC68020.  Since  the 
number  of  programmer  visible  registers  in  the  PLM  is  more  than  that  available  in  the 
MC68020,  we  were  forced  to  put  some  PLM  registers  into  memory  for  the  MC68020. 
Because  the  MC88020  requires  at  least  three  cycles  for  a  memory  access,  we  tried  to 
assign  the  least  frequently  used  registers  to  memory.  We  also  eliminated  some  PLM 
registers  by  slightly  changing  the  WAM  definition,  if  this  were  advantageous  to  the 
MC68020.  These  changes  are  listed  below  in  detail. 

1.  We  eliminated  the  HB  register,  since  the  net  effect  of  keeping  it  in  memory  would 
have  been  the  same  as  accessing  it  through  the  B  register,  which  points  to  the 

choice  point  frame  containing  HB. 

2.  According  to  both  Dobry  l6]  and  Mulder  and  Tick  [l],  the  gain  obtained  from 
environment  trimming  is  negligible.  The  execution  model  in  [l]  does  not  implement 
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it;  we  decided  not  to  implement  it  either.t  In  order  to  achieve  this,  we  changed  the 
semantics  of  the  allocate  instruction  so  that  it  has  an  argument  indicating  how 
much  space  is  to  be  allocated  for  the  environment.  In  addition,  this  eliminated  the 
need  for  the  N  register  which  contained  the  number  of  permanent  variables  needed 
after  returning  from  a  call.  We  also  changed  the  addressing  mechanism  for  the  per¬ 
manent  variables  to  a  negative  oflset  from  the  environment  pointer  instead  of  a 
positive  offset.  The  semantics  of  allocate  N  require  that  the  top  of  the  stack  be 
incremented  by  N  with  the  top  four  locations  dedicated  for  E,  CP,  B  and  cut-flag. 
Permanent  variables  are  located  below  these  four  locations.  Note  that  the  variable 
number  starts  at  Y4  instead  of  YO. 

3.  We  assigned  separate  unify  instructions  for  read  and  write  modes.  This  eliminated 
the  need  for  a  register  and  a  comparison  at  run  time  to  test  the  mode.  This  tech¬ 
nique  is  based  on  the  fact  that  once  unification  proceeds  in  the  write  mode,  the  fol¬ 
lowing  unifications  are  all  in  the  write  mode.  This  required  some  changes  in  the 
compiler,  but  we  feel  the  changes  are  minor  and  will  not  result  in  a  significantly 

larger  code  size. 

4  We  used  the  two  least  significant  bits  in  a  data  word  as  the  primary  tag  to  distin¬ 
guish  among  reference,  constant,  list  and  structure;  see  figure  2.  The  next  two 
least  significant  bits  are  used  as  the  secondary  tag  to  distinguish  between  integer, 
atom,  and  other  data  types.  Note  we  have  used  a  tagging  scheme  different  from 
the  one  used  for  the  PLM  in  order  to  speed  up  tests  for  integers  and  dereferencing. 
We  used  the  most  significant  bit  as  the  cdr  bit  for  cdr-com pressed  list  representa¬ 
tion. 


t  By  eliminating  environment  trimming,  we  improve  the  performance  of  the  MC68020.  However 
this  improvement  represents  a  savings  of  fewer  than  \%  of  the  total  cycles  needed  to  execute  the 


benchmarks. 


3.  Experimental  Methodology. 


3.1.  Basic  Measurements. 

Our  objective  isas  to  compare  the  performance  of  a  special  purpose  processor  (the 
VLSI'PLM)  with  that  of  a  general  purpose  processor  (MC68020)  on  a  set  of  Prolog 
benchmarks.  First,  using  our  Prolog  compiler  [5],  we  compiled  all  benchmarks  into 
modified  WAM  (i.e.,  PLM)  code.  For  the  MC68020,  we  then  macro-expanded  all  PLM 
instructions  and  escapes  into  MC68020  machine  code.  We  attempted  only  limited 
optimization  for  the  resulting  68020  instructions.  For  the  VLSl-PLM,  PLM  instructions 
execute  directly  in  microcode,  while  those  escapes  needed  by  the  benchmarks  (with  the 
exception  of  mod  and  div)  generate  call  instructions  to  librarj'  routines  which  themselves 
are  made  up  of  VLSl-PLM  instructions. 

We  used  the  VLSl-PLM  simulator  (7,8)  to  obtain  the  frequencies  of  each  PLM 
instruction,  as  well  as  the  frequencies  of  failure,  dereference,  trail,  and  decdr.  For  each 
PLM  instruction,  we  also  obtained  the  frequencies  of  different  execution  paths. 

We  used  the  Motorola  User’s  Manual  [4]  to  generate  a  timing  table  consisting  of  the 
number  of  MC68020  cycles  required  to  execute  each  PLM  instruction,  according  to  best, 
cache  and  worst  case  situations  for  executing  each  MC68020  instruction.  These  three 
situations  correspond,  respectively,  to  whether  both  MC68020  instruction  overlap  is  pos¬ 
sible  and  the  instruction  access  hits  in  the  256  byte  on-chip  instruction  cache,  whether 
instruction  overlap  is  not  possible  but  the  instruction  access  hits  in  the  on-chip  cache,  or 
whether  neither  is  possible.  Appendix  A  contains  this  cycle  count  information.  Since 
the  execution  time  of  some  PLM  instructions  is  very  data  dependent,  we  also  included 
separate  entries  for  each  subcase. 

To  calculate  the  number  of  cycles  the  MC68020  would  take  to  execute  each  bench¬ 
mark  program,  we  multiplied  the  occurrence  of  each  PLM  instruction  by  its  correspond¬ 
ing  entry  in  the  timing  table.  The  number  of  cycles  the  VLSl-PLM  would  take  to  exe¬ 
cute  each  benchmark  was  obtained  by  running  the  benchmark  on  the  VLSl-PLM  Simula- 
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Table  1  (see  Appendix  B)  reports  the  ratio  of  machine  cycles  required  by  the 
MC68020  and  by  the  PLM  to  execute  each  of  the  fourteen  benchmarks.  Three  entries 
are  reported  for  each  benchmark,  reflecting  the  best  case,  cache  case,  and  worst  case 
described  above.  We  also  included  in  Table  1,  for  comparison  purposes,  the  cycle  counts 
for  the  three  benchmarks  reported  in  [l]. 

To  calculate  execution  times  for  the  VLSl-PLM  and  for  the  MC68020  chip,  we 
assumed  cycle  times  as  follows:  For  the  VLSl-PLM,  we  assumed  two  clock  rates,  10  MHz 
and  16.7  Mhz.  For  the  MC68020,  we  assumed  clock  rates  of  10  Mhz,  16.7  Mhz,  25  Mhz, 
30  Mhz  and  40  Mhz.  We  assumed  that  the  memory  system  could  respond  within  one 
cycle  for  the  PLM  and  within  three  cycles  for  the  MC68020,  since  the  MC68020  requires 
at  least  three  cycles  for  a  memory  access. 

Table  2  shows  the  relative  performance  (i.e.,  the  reciprocal  of  execution  time)  of  the 
various  frequency  MC68020s,  normalized  to  the  10  MHz  PLM.  Table  3  shows  the 
equivalent  relative  performance  of  the  various  frequency  MC68020s,  normalized  to  the 
16.7  MHz  PLM.  Table  3  has  been  included  in  this  report,  because  our  current  under¬ 
standing  of  microarchitecture  for  Prolog  (c.f.,  section  4.1)  makes  a  16.7  MHz  PLM  not  a 
difllcult  challenge. 

3.2.  Calibration. 

The  calculations  rept)rted  in  Tables  2  and  3  describe  very  different  levels  of  perfor¬ 
mance  for  the  MC68020,  depending  on  which  of  the  three  cases  (best,  cache,  or  worst) 
we  choose  to  believe  is  most  nearly  correct.  To  calibrate  our  calculated  data,  we  ran 
several  of  the  benchmarks  on  a  SUN  3/260,  operating  essentially  stand-alone,  with  a 
verj'  large  cache.  Assuming  a  100%  cache  hit  ratio,  and  no  wasted '  cycles  due  to 
memory  delays,  the  measured  data  on  the  SUN  3/260  should  correspond  very  nearly  to 
the  calculated  results.  Table  4  reports  the  calculated  cycles  vs.  the  real  cycles  for  six  of 
the  benchmarks.  It  appears  that  the  real  execution  time  is  somewhere  between  the 

cache  case  and  the  worst  case. 
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4.  Analyii*  of  the  Results,  ftnd  some  Observations. 

4.1.  Microarchitectures  for  Prolog. 

The  VLSI-PLM  represents  one  implementation  in  a  sequence  of  designs,  not  the 
end  result.  That  is,  although  the  results  of  Tables  2  and  3  are  quite  respectable,  it  is 
important  to  keep  in  mind  that  the  VLSI-PLM  in  no  way  sets  a  limit  on  the  perfor¬ 
mance  that  can  be  obtained  with  special  purpose  Prolog  processors. 

Our  first  implementation,  the  Berkeley  PLM,  was  designed  in  1983  and  1984  [2,3]. 
As  our  work  has  progressed,  our  understanding  of  Prolog  has  increased.  This  increased 
understanding  is  reflected  in  the  VLSI-PLM,  for  example,  with  respect  to  the  execution 
of  built-ins,  that  is,  as  library  routines  consisting  of  instructions  from  the  base  PLM 
machine.  This  technique,  sometimes  referred  to  as  "millicode,"  is  the  mechanism  Digital 
Equipment  Corporation  used  to  implement  the  full  VAX  architecture  on  the  microVAX 

II  chip. 

Furthermore,  already  in  the  pipe  is  one  of  our  current  designs,  PUP  (parallel 
unification  of  Prolog),  which  achieves  (based  on  a  full  register  transfer  level  simulator 
written  in  N2)  about  twice  the  performance  of  the  VLSI-PLM  by  concurrently  process¬ 
ing  WAM  instructions  by  means  of  multiple  function  units  [9]. 

Finally,  two  other  things  should  improve  the  performance  of  a  single  chip  Prolog 
processor  relative  to  a  general  purpose  MC68020;  tuning  the  \LS1  circuitry  and  tuning 
the  microarchitecture.  With  respect  to  circuits,  as  advanced  VLSI  technology-  becomes 
more  readily  available,  the  wide  disparity  between  the  degree  to  which  one  can  tune  a 
special  purpose  circuit  and  the  degree  to  which  one  can  tune  a  high  performance  general 
purpose  part  should  diminish.  With  respect  to  microarchitecture,  the  \  LSl-PLM  has 
not  yet  been  aggressively  pipelined.  Continued  attention  to  critical  path  design  should 
produce  an  improved  cycle  time.  The  16.7  MHi  clock  suggested  in  Table  3  is  a  reflection 
of  that. 
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4.2.  Degradation  due  to  unoptimieed  MC68020  code. 

We  were  concerned,  in  fairness  to  advocates  of  off-the-shelf  hardware,  coupled  usu¬ 
ally  with  a  reliance  on  optimizing  compilers,  that  our  MC68020  target  machine  code  does 
in  fact  suffer  from  a  fairly  pedestrian  hand-compilation  from  Prolog  via  WAM  intermedi¬ 
ate  code.  We  see  a  developing  industry,  represented  by  Quintus  and  BIM,  for  example, 
which  suggests  that  optimal  execution  of  Prolog  on  off-the-shelf  processors  will  have  the 
advantage  of  heavily  optimized  compilations. 

To  measure  the  degradation  due  to  our  straightforward  macro  expansion,  we  tested 
six  of  the  benchmarks  on  our  Tower  workstation  (a  16.7  MHz  68020).  We  compared  the 
number  of  cycles  required  by  our  macroexpansion  method  with  the  number  of  cycles 
required  by  the  Quintus  system  (release  1.5,  running  under  UTS  V).  Table  4  shows  the 
number  of  cycles  required  to  execute  each  of  six  Prolog  benchmarks  by  the  two  methods. 
Over  the  six  benchmarks,  the  Quintus  system  outperformed  the  macroexpanded  version 
on  five  of  them.  We  are  not  drawing  any  strong  conclusions  on  this  little  data  other  than 
to  say  that  it  appears  our  macroexpanded  code  does  not  seriously  skew  our  comparisons. 

4.3.  Code  Explosion. 

Another  concern  that  should  not  be  minimized  in  any  comparative  study  of  special 
purpose  vs.  general  purpose  processors  is  the  code  explosion  problem.  In  most  cases,  we 
have  found  that  compiling  to  a  lower  level  architectural  interface  results  in  a  significant 
increase  in  the  code  size  [10].  Table  5  shows  the  relative  code  sizes  of  the  PLM  and  the 
macro  expanded  MC68020  over  eight  of  the  14  benchmarks.  This  code  explosion,  more 
than  15  to  1  in  many  cases,  does  degrade  performance  with  respect  to  memory 
bandwidth  and  cache  hit  ratio.  The  simplistic  answer  of  a  larger  cache  is  unacceptable 
since  it  is  a  fact  that  larger  caches  are  slower  caches.  Although  the  benchmarks  of  this 
study  fit  well  within  the  limits  of  the  cache,  Prolog  applications  of  the  future  will  not  be 

able  to. 


4.4.  The  Significance  Method  uied  by  Mulder  and  Tick. 

We  were  concerned  that  the  significance  method  used  by  Mulder  and  Tick  could 
have  created  the  appearance  of  convergence,  when  in  reality,  the  "next"  non- 
implemented  PLM  construct  would  have  caused  divergence.  This  certainly  could  have 
been  the  case.  It  turned  out,  however,  that  the  results  obtained  by  carrying  out  the 
macroexpansions  for  all  45  PLM  constructs  mirrored  those  obtained  by  stopping  after 
the  first  14.  The  first  entries  in  Table  1  show  the  results  for  the  benchmarks  chat  and 
boyer  obtained  via  the  significance  method  (Mulder-Tick)  and  via  macroexpanding  all  45 
constructs  (Patt-Chen).  The  difference  in  the  numbers  reported  in  Table  1  are  more 
likely  to  be  due  to  differences  between  the  two  exection  models  (i.e.,  plml  vs.  VLSl- 
PLM),  rather  than  due  to  any  differences  resulting  from  the  significance  method. 

4.5.  Best,  Cache,  and  Worst  Cases. 

The  Motorola  Uter't  Manual  gives  timing  information  for  the  following  situations. 
The  bett  case  is  when  the  instruction  is  in  the  on-chip  instruction  cache  and  maximum 
overlap  is  achieved.  The  cache  case  is  when  the  instruction  is  in  the  cache  but  there  is 
minimum  overlap  between  instructions.  The  wont  case  is  when  the  instruction  is  not  in 
the  cache  and  there  is  no  overlap  between  instructions.  We  summed  all  cycles  according 
to  the  three  cases.  For  the  best  case,  we  can  safely  conclude  that  it  is  a  veiy-  optimistic 
upper  bound  on  performance,  not  something  that  is  really  achievable.  For  example,  the 
code  sequence  to  macroexpand  the  switch_on_tag  routine  (reproduced  as  figure  3)  which 
forms  the  core  of  the  switch  instructions  yields  a  best  case  of  eight  cycles.  However,  a 
small  fine-tuned  measurement  demonstrated  that  it  can  not  execute  in  less  than  28 
cycles. 

More  importantly,  recall  the  data  of  Table  4.  We  conclude  that  the  real  perfor¬ 
mance  of  the  MC68020  is  probably  somewhere  between  the  cache  case  and  the  worst 
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4.6.  Memory  Cycle  DifferenceB. 

The  basic  memory  access  protocols  for  the  MC68020  and  the  PLM  are  different. 
The  PLM  is  designed  with  a  cache  system  and  a  write  buffer,  enabling  the  memory  sys¬ 
tem  to  respond  in  one  cycle.  The  MC68020,  on  the  other  hand,  uses  at  least  3  cycles  for 
a  memory  access.  This  difference  between  memory  access  protocols  for  the  MC68020  and 
the  PLM  is  very  important  to  the  relative  performance  of  the  two  machines  for  the  fol¬ 
lowing  reason.  First,  since  the  MC68020  code  siie  is  much  greater,  and  its  on-chip  cache 
consists  of  only  256  bytes,  it  needs  to  make  a  lot  more  instruction  fetches  than  the  PLM. 
Also,  like  most  A1  software,  Prolog  programs  tend  to  be  memory  intensive  (i.e.,  more 
than  three  data  accesses  per  PLM  instruction  is  not  uncommon).  Both  situations  result 
in  the  MC68020  spending  much  more  time  in  memorj-  accesses  due  to  its  three-cycle  pro¬ 
tocol. 


4.7.  Cdr-compressed  List  Representation. 

The  use  of  cdr-compressed  list  representation  has  received  substantial  treatment  in 
the  Prolog  literature.  The  VLSI-PLM  uses  it  and  has  hardware  support  for  checking  the 
cdr  bit.  The  MC68020  has  no  such  hardware  support,  and  so  incurs  a  penalty  each  time 
it  needs  to  check  the  cdr  bit.  Dobry  (2)  observed  that  lists  tend  to  be  non-contiguous 
after  some  manipulation.  For  such  lists,  the  VLSI-PLM  suffers  little  penalty  compared 
with  the  cdr-compressed  case.  One  can  argue  legitimately  that  the  use  of 
cdr-compressed  lists  unfairly  penaliies  the  MC68020.  On  the  other  hand, 
cdr-compressed  lists  save  memory  accesses,  and  as  we  have  already  discussed,  this  is  a 
major  bottleneck  in  MC68020  performance.  On  balance,  the  use  of  cdr-compressed  lists 
probably  hurts  the  MC68020,  although  how  much  so  is  not  clear.  The  extra  memory 
access  is  coupled  with  simpler  (smaller  in  site  and  fewer  branches)  instructions.  In  this 
study,  we  elected  to  use  the  cdr-compressed  list  representation  to  maintain  consistency 
across  the  Prolog  benchmarks,  although  we  recognize  that  by  so  doing,  we  may  have 
penalized  the  MC68020  unfairly  -  estimated  to  be  as  much  as  109?>  in  the  case  of  the 
chat  benchmark,  around  5%  in  the  case  of  boyer. 
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4.8.  Hardware  Support  for  Tags. 

The  VLSl-PLM  has  an  advantage  over  the  MC68020  for  instructions  involving  tag 
manipulation,  for  example,  dereference  and  general  uniheation.  This  is  mainly  due  to 
three  factors:  (1)  The  PLM  has  a  powerful  multiway  branch  mechanism  based  on  tags. 
(2)  The  PLM  can  do  tag  manipulation  on  programmer  invisible  registers.  (3)  The 
MC68020  suffers  from  the  separation  of  its  address  and  data  registers,  i.e.,  the  data 
registers  can  not  be  used  to  address  memory  directly  and  the  address  registers  can  not 
be  used  for  tag  manipulation. 

However,  the  PLM  does  not  have  as  great  an  advantage  for  instructions  which  do 
not  require  tag  manipulation.  These  instructions  include  those  which  manipulate  choice 
points,  put  instructions,  and  write-mode  unification  instructions.  Unfortunately  for  the 
PLM,  these  Instructions  occur  almost  as  frequently  as  those  that  require  tag  manipula¬ 
tion.  For  example,  for  the  relational  operator  built^ins,  when  the  PLM  has  to  go  off-chip 
and  it  can  not  utilize  its  tag  manipulation  capability,  the  VLSl-PLM  needs  54  cycles 
while  the  MC68020  needs  17  cycles  in  the  best  case,  78  cycles  in  the  cache  case,  and  99 
cycles  in  the  worst  case.  At  best,  this  produces  a  cycle  count  advantage  for  the  VLSl- 

PLM  of  1.8  to  1. 

5.  Concluding  Remarks. 

This  report  has  attempted  to  continue  along  the  same  vein  as  the  work  of  Mulder 
and  Tick  [l]  and  further  compare  the  performance  of  a  special  purpose  processor  and  a 
general  purpose  processor  with  respect  to  the  execution  of  Prolog  benchmarks.  For  the 
general  purpose  processor,  we  continued  with  Mulder  and  Tick’s  choice  of  the  MC68020. 
For  the  special  purpose  processor,  we  chose  our  most  recent  implementation,  the  VLSl- 
PLM,  which  is  currently  in  fabrication. 

A  fair  comparison  is  fraught  with  obstacles.  There  is  no  existing  VLSl-PLM  sys¬ 
tem,  so  we  can  not  simply  run  the  benchmarks  on  both  systems.  On  the  other  hand,  we 
do  not  have  a  comprehensive  MC68020  simulator  so  we  can  not  simply  count  sanitized 


cycles  in  both  cases. 
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Our  method  -was  to  first  compile  Prolog  benchmarks  to  modified  WAM  code.  After 
that,  in  one  case  we  executed  the  WAM  code  on  the  VLSl-PLM  simulator  provided  by 
[7],  and  in  the  other  case,  macroexpanded  the  WAM  code  into  MC68020  code  and 
counted  the  number  of  cycles  it  would  take  to  execute.  To  calibrate  the  68020  calcula¬ 
tions,  six  of  the  14  benchmarks  were  executed  on  a  very  lightly  loaded  SUN  3/260  sys¬ 
tem  containing  enough  cache  memory  (64  KB)  to  reasonably  guarantee  there  would  be 
no  wait  states  waiting  for  memory. 

Although  several  problems  with  this  study  exist,  some  general  statements  are  possi¬ 
ble  as  delineated  in  section  4.  The  most  relevant  performance  figures  are  those  con¬ 
tained  in  Table  3,  which  assumes  a  16.7  MHi  VLSl-PLM.  If  we  assume  approximately 
worst  case  68020  behavior,  suggested  by  Table  4,  and  a  30  MHz  68020,  we  see  about  a 
factor  of  between  three  and  four  in  performance  in  favor  of  the  VLSl-PLM.  If  we  add  to 
this  an  improved  VLSI  technology  implementation  and  the  effects  of  significant  code 
explosion  (15  to  1  not  uncommon,  see  Table  5),  the  performance  ratio  can  be  even 
greater. 

Acknowledgement. 

This  report,  like  most  of  the  work  done  on  the  Aquarius  project,  draws  on  the 
knowledge  and  insights  of  many  members  of  our  research  team,  rather  than  simply  those 
of  the  authors  of  this  report.  It  is  our  pleasure  to  gratefully  acknowledge  the  help  of 
Bruce  Holmer,  Vason  Srini,  and  Alvin  Despain.  Financial  support  to  carry  out  this  work 
came  from  a  number  of  sources,  including  NCR  Corporation,  Digital  Equipment  Cor¬ 
poration,  and  the  California  Micro  program.  Part  of  this  work  was  sponsored  by 
Defense  Advanced  Rsearch  Projects  Agency  (DoD),  Arpa  Order  No.  4871,  Monitored  by 
Space  &  Naval  Systems  Warfare  Command  under  Contract  No.  N00039-84-C-0089. 

References. 

[1]  Mulder,  H.,  Tick,  E.  A  performance  Compariion  between  PLM  and  an  MC68020  Pro¬ 
log  Proceisor,  Technical  Note  No.  CSL-86-302.  (Sept.  1986). 


.  13- 


12]  Dobiy,  T.,  P.«.  Y.  wd  Dop.i».  A.,  Design  Deei.ions  InBueneing  the  Mieronrehilee- 
tnre  tor  .  Prolog  Machine,  Con/.  Prec.,  nth  Aonoal  ;nIero«I..m«t  H’orWop  eo 
J^tcTopTogranuntTig,  (October  1984). 

131  Dohry,  T..  Despain,  A.  and  Patt,  Y.,  Performance  St.die,  of  a  Prolog  Machine  Archi¬ 
tecture,  Con/.  Proc.,  teth  Annnaf  /nlernolionol  Sympooioni  on  Computer  Archilec- 

lure,  pp.  180-190,  (June  1985). 

[4]  Motorola.  MC68020,  S2-Bit  Mieroproce»$er  Uter’s  Manual. 

15)  VanRoy,  P.,  A  Prolog  Compflcr  /or  the  FLU.  Master  Thesis,  U.C  Berkeley.  (Augost, 
1984) 


16]  Dobr>-,  T.  A  High  Performance  Architecture  for  Prolog,  Ph.  D  dissertation.  Report 
No.  UCB/CSD  87/352.  (May  1987). 

(7)  Holmer,  B.  VLSI-PLM  Simulator  and  Uter  Manual,  in  preparation. 

181  Srini,  V.P.,  Tam,  J.V„  Nguyen,  T.M.,  Patt,  Y.N.,  Despain,  A.M.,  Moll,  M.,  and  Ells- 
srorth,  D.,  A  CMOS  Chip  for  a  Prolog  Processor,  Proeeedmjs  0/  ICCD  I9S1,  Ne« 

York,  (October,  1987). 

191  Chen,  C.,  Singhal,  A.  and  Patt,  Y.,  FVP:  An  A-cMlccIurc  to  E.phil  P.r./lcl 
Unification  in  Prolog,  unpublished  report  (October,  198f). 

[10]  Patt,  Y.,  Several  Implementations  of  Prolog.  IEEE  Transactions  on  Systems,  Man, 
Cybernetics,  (accepted  for  publication.  Spring  1988). 


Figure  1.  Register  Mapping  between  PLM  and  MC68020 
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Ficure  t.  Data  Eneodlng  Format  for  MC680*0 
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Figure  8.  ewltch.tag  Macro  and  Timing 
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Appendix  A.  Macroexpanded  WAM  Construct,  and  Timing  Information, 
t  A.l  Macro  Expansions 
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A.*.  Get  Initructlons 


8ft. 

_conBt»nt_X 

inpnt: 

constnnt  C  nnd  nrgument  re*iiter 

number  Xi 

ootput: 

fuDctioD. 

iDstruction 

Unify  C  with  Xi 

best 

enche 

womt 

|et_consUBt_X: 

move  1 

move.l 

movt.l 

brs 

#C,D0  0  (0/0/0) 

Xi.Dl  0  (0/0/0) 

PDL,pdlB»se(MP)  S  (0/0/1) 
unifyl  3  (0/0/1) 

6 (0/0/0) 

2  (0/0/0) 

6  (0/0/1) 

7  (0/0/1) 

S  (0/1/0) 

3  (0/1/0) 

7  (0/1/1) 
13(0/2/1) 

cycles 

8+  n-4  (2) 

20-r  u-10  (2) 

S3-r  n-12  (7) 

move  1 

Xi(MP),Dl 

3  (1/0/0) 

7  (1/0/0) 

9  (1/1/0) 

Xi  in  memory 

3  (1)  more 

6  (1)  more 

6  (2)  more 

] 

L - - - - - 

_ _ _  ~l 

get_vilue_X 

input: 

input  nrguments  Xi  nnd  Xj 

output: 

function: 

tinifv  Xi  with  Xj 

— 

Instruction 

best 

ciche 

worst 

get_vnlue_X: 

move.l 

move.l 

move.l 

brs 

Xi.DO 

Xj.Dl 

PDL,pdlB»se(MP) 

unify 

0 (0/0/0) 

0  (0/0/0) 

3  (0/0/1) 

5  (0/0/1) 

2 (0/0/0) 

2  (0/0/0) 

S (0/0/1) 

7  (0/0/1) 

3 (0/1/0) 

3  (0/1/0) 

7  (0/1/1) 
13(0/2/1) 

cycles  8+  u  •  (2) 

X  in  mem  3  (l)  or  6  (2) 

*  u  is  the  time  spent  in  the  unihcition  routine 

16f  u  •  (2) 

7  (1)  or  Id  (2) 

26-r  u  •  (7) 

8(1)  or  18(2) 

gel_vnlue_Y 

input: 

Permenint  viriitle  ^’i  nnd  nrgument  Xj 

output 

function 

unify  Yi  ind  Xj 

Instruction 

best 

cnche 

worst 

get_vilue_Y 
move.l 
move.l 
move  1 
brs 

-d»i(E),D0  3  (1/0/0) 

Xj.Dl  0  (0/0/0) 

PDL,pldBnse(MP)  3  (0/0/1) 

unify  3  (0/0/1) 

7  (1/0/0) 

2 (0/0/0) 

S (0/0/1) 

7  (0/0/1) 

9  (1/1/0) 

3  (0/1/0) 

7  (0/1/1) 
13(0/2/1) 

cycles 

11+  u  •  (3) 
Xi  in  memory  3  (1)  more 

21+ u  •  (3) 

7  (1)  more 

32+  u  •  (8) 

8  (l)  more 

*  u  is  the  time 

needed  by  the  uniScnticn  subroutine 

get  v»riiblc_X 

input 

argumefit  registen  Xi  utd  Xj 

ontput: 

function 

Move  the  content  of  Xj  to  Xi 

Inatruction 

best 

cnche 

«ont 

get_viri»ble_X: 

move  1 

Xj.Xi  0  (0/0/0) 

2  (0/0/0) 

S (0/1/0) 

cycles 

0(0) 

2(0) 

S(l) 

get  v»riible_X 

input: 

Argument  registers  Xi  And  Xj 

output: 

function 

Xj  is  in  memorv  rAther  tbAO  a  mAchine  register 

Instruction 

best 

CAche 

worst 

get_viriible_X; 

move! 

Xj,Xi(bfP)  3  (0/0/1) 

S (0/0/1) 

7  (0/1/1) 

cycles 

3(1) 

S{1) 

7(2) 

get  vinible_X 

input 

output 

function 

Argument  registers  Xi  And  Xj 

Xj  is  in  the  memory 

Instruction 

best 

CAche 

wont 

get_v»ri»ble_X: 
move  1 

Xj(MP),Xi  3  (1/0/0) 

7 (1/0/0) 

9  (1/2/0) 

cycles 

3(1) 

7(1) 

9(3) 

gft_vinibl«_X 

input  irfumeot  rrgifter!  Xi  »nd  Xj 

output 

function  Both  Xi  und  Xj  In  memory 


gft_Ttritble_Y 


input: 

_ _ _ _  — 

Pennennnt  vnrinble  Yi  nnd  »riuin«n‘  register  Xj 

output: 

function 

Xi  it  in  memory 

Instruction 

get_v»ri»ble_Y: 

best 

Vi(MPl.-fi(E)  6  (1/0/1) 

1  (1/0/1) 

IS  (1/2/1) 

1  cycles 

6(2) 

«  (2) 

18(4) 

feljist 


inpat: 

oatput: 

function; 


tisnment  regi>*r  Xi 

„t  up  for  uniScntion  of  Xi  and 


best 


encbe 


eont 


getjist  Xi: 
dtrtfertnct 
Lref: 
move! 
or  b 
btst 
beq 
bset 
1; 

move  1 

trail 

move  1 

brn 

Lnref: 

move.) 

bftsl 

ble 

bclr 

move!  I 
rend.upify. 


(Xi,  Lref,  Lnref) 
H,D0 

^listtig.DO 

#cdr,Xi 

1 

#cdr,D0 

DO, (AO) 

Xi,A0 

D0,Xi 

write_nnify 

Dx.DO 
DO  [0,21 
Jnil 
#0,D0 
D0,S 


cycles 


reference 

list 

others 

Xt  in  memory 


5(0) 

12(0) 

17  (5) 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

0 (0/0/0) 

4  (0/0/0) 

6  (0/2/0) 

1 (0/0/0) 

4  (0/0/0) 

5 (0/1/0) 

1/3 

4/6 

5/0 

1 (0/0/0) 

4  (0/0/0) 

5 (0/1/0) 

3  (0/0/1) 

4  (0/0/1) 

5  (0/1/1) 

4  (0) 

20  (0) 

28(7) 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

3 

6 

9 

6(0) 

12(0) 

16(3) 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

3 (0/0/0) 

6  (0/0/0) 

7 (0/1/0) 

1/3 

4/6 

5/9 

1  (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

0  (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

19+  d+  t  (1) 

62+  d+t  (1) 

86+  d+  t  (2 

11  +  d  (0) 

30+  d  (0) 

39+ d  (8)  • 

12+  d  (0) 

26+  d  (0) 

32+ d  (7)  • 

3(1)  more 

0  (1)  more 

12  more  (4) 

dcrejercnet 
•  d  is  time 


ind  trail  an  to  be  microexpinded 

needed  to  derefence  ind  t  is  the  time  needed  to  do  trnil  thecking 
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get  rtrsetarr 

input: 

output: 

function: 

Xi 

Mi  Up  for  the  noificatioD  of  Xi  and  a  atructure 

Instruction 

best 

cnche 

worst 

get_Btructure 

dtrtjtrtnce 

(Xi,  Lref,  Lnref) 

Lrtf: 

5(0) 

12  (0) 

17  (5) 

move.l 

H.DO 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

orb 

#structUg,DO 

0 (0/0/0) 

4 (0/0/0) 

6  (0/2/0) 

btst 

#cdr,Xi 

1 (0/0/0) 

4 (0/0/0) 

5  (0/1/0) 

beq 

1 

1/S 

4/e 

5/9 

bsct 

#cdr,D0 

1 (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

J: 

move  1 

DO, (AO) 

3 (0/0/1) 

4 (0/0/1) 

5  (0/1/1) 

trail 

Xi,A0 

4(0) 

20(0) 

28  (7) 

move.l 

D0,Xi 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

move  1 

#F  unctor,(H)+ 

4  (0/0/1) 

(  (0/0/1) 

7  (0/1/1) 

bra 

write_unify_ 

3 

6 

9 

Lnref 

ttil  tiruci 

(DO,  Jiil) 

1/3(0) 

16/18  (0) 

23/27  (7) 

ind  b 

#0xfc,D0 

0 (0/0/0) 

4 (0/0/0) 

6  (0/2/0) 

m oven  1 

D0,S 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

cmpi.l 

#F,(S)+ 

0+5  (1/0/0) 

2+8  (1/0/0) 

3+9  (1/1/0) 

bne 

Jiil 

1/3 

4/6 

5/9 

reidy_unify_ 

cycles  • 

reference 

23+ d+  t  (2) 

70+  d+ 1  (2) 

93+  d+  t  (25) 

structure 

l3+d  (1) 

48+ d  (1) 

65+d  (16) 

list  or  const 

9+  d  (0) 

30+  d  (0) 

43+  d  (11) 

functor  not  mntch  lS+d(l) 

50+  d  (1) 

69+d  (17) 

Xi  in  memory 

3  (1)  more 

7  (1)  more 

9  (4)  more 

derefertnet, 

trail  lai  teil_tlruel  are  to  be  microexpnnded 

1  •  d  time  needed  for  dereference, 

and  t  lime  needed  for  trail  the  bindings 
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4. 

A.3.  Put  Instruction* 


[  pnt  ▼»ri»blf_X 

input: 

4r|(QmeDt  ref  ist-er  Xi 

output 

fuDCtlOB: 

Creite  nu 

in  Xi 

unbouud  winble  on  the  he»p  nnd  P«t  ‘k*  poii>‘et 

loftructioD 

put_v»ri»bl«_JC: 

move! 

H.Xi 

best 

0  (0/0/0) 

4  (0/0/1) 

cache 

2  (0/0/0) 

4  (0/0/1) 

S (0/1/0) 

6  (0/1/1) 

cycles 

Ml) 

6(1) 

«(3) 

put  v»ri»ble_Y  ^ 

input: 

Penoenani  variable  Yi 

output: 

function: 

Create  an  unboiind  pertneoant  variable  Yi 

Instruction 

best 

enche 

wont 

put_v»ri»ble_Y: 

len 

move.l 
move  I 

-^•i(E),A0  *  (0/0/0) 

AO, (AO)  8  (0/0/1) 

AO.Xj  0  (0/0/0.) 

4  (0/0/0) 

4  (0/0/1) 

2 (0/0/0) 

6 (0/2/0) 

S (0/1/1) 

S  (0/1/0) 

cycles 

Xi  in  register  7  (1) 

10(1) 

IMS) 

cut  v»lne_X 

input: 

nrgument  Xi  »nd  Xj 

output: 

function 

move  the  content  Xi  into  Xj 

Instruction 

best 

cnche 

worn 

put_v»Iue_X: 

move.l 

Xi.Xj 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

cycles 

Xi.Xj  in  registers 

Xi  in  memory 

Xj  in  memory 
both  Xi.Xj  in  memory 

0(0) 

3(1) 

3  (1) 

6(2) 

2(0) 

7(1) 

S(l) 

«  (2) 

3(1) 

0(2) 

7(2) 

13(3) 

This  instruction  is  the  sime  »s  the  get. 

vilue_X,  except  the  direction 

of  d»t»  movement  is  reversed 

cut  vilue_Y 

input: 

permenint  vxrixble  Yi  »nd  nrgument  register  Xj 

output: 

function 

Move  the  content  of  Yi  into  Xj 

Instruction 

best 

cnche 

woret 

put_v»lue_Y ; 
move  1 

-<»i(E).Xj  3  (1/0/0) 

7  (1/0/0) 

8 (1/1/0) 

cycles 

Xj  in  register  3(1) 

Xj  in  memory  6  (2) 

7(1) 

8(2) 

8(2) 

13(3) 

V 


pUt_tOD8t»Dt  Xi _ 

Con8t»nt  C  »nd  »rgnment  regiiter  Xi 


inpot: 

OBtpat: 

fUDCtioD 


move  C  into  Xi 


put_constAnt 

move! 

#ConstAnt,Xi 

0  (0/0/0)  6 (0/0/0) 

5 (0/1/0) 

cycles 

Xi  in  register 

Xi  in  memory 

m 

input: 

output: 

function: 


In!truction 


putjist: 

move! 

orb 

write_unify_ 


cycles 


putjist  _ 


Argument  Xi 

initiiliie  Xi  to  point  to  the  henp  where  the  list  is  built 


H,Xi 

#IlSttAg,Xl 


Xi  in  register 


best 

CAche 

worBt 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

0  (0/0/0) 

*  (0/0/0) 

6 (0/2/0) 

0(0) 

6(0) 

6(3) 

3(1) 

11  (1) 

16  (4) 

input 

output: 

function: 


put_8tructure  _ _ 


Functor  F  »nd  Argument  Xi 

initiAliie  Xi  to  point  to  the  henp  where  a  structure  with 
functor  F  is  built 


1  DBt  BDIlfe.VtIue 

input. 

eatpot: 

fsBCtioD 

PermesMt  v»n»ble  Yi  nnd  nrjument  register  Xj  (on  chip) 

U  Yi  is  dereferenced  to  nn  nnbonnd  vnnnble,  n  new  vnrinble 
is  erneted  on  the  henp  and  is  pat  into  Xj  Otherwise,  the  dereferenced 
v&luf  of  Yi  it  pot  into  Xj  -  - — ■ 

inftrnctioD 

best 

cache 

worst 

pnt_uns»f(_v»luf 

move.l 

icTtferenct 

Lref: 
cmp.t 
bgt 
trail 
move  1 
move  1 

Lnref 

-t*i(E),Xi 
(Xi,  Lref,  Lnref) 

#sUck_bue,Xi 

Lnref 

Xi,A0 

H,Xi 

H,(H)+ 

3  (1/0/0) 

5(0) 

0  (1/0/0) 

1/3 

4(0) 

0 (0/0/0) 

4  (0/0/1) 
6(0) 

7 (1/0/0) 

12  (0) 

2+  4  (1/0/0) 

4/6 

20(0) 

2 (0/0/0) 

4 (0/0/1) 

12(0) 

8 (1/2/0) 

17(5) 

S+ 5  (1/2/0) 

5/8 

28  (7) 

3 (0/1/0) 

5  (0/1/1) 

16(3) 

cycles 

Xi  reg  not  ref 

8+d  (1) 

18+  d  (1) 

25+  d  (6) 

ref  no  move 

11+ d  (2) 

31+  d  (2) 

43+  d  (13) 

ref  move 

17+ d  (3) 

55+  d  (3) 

75+  d  (22) 

Xi  is  is  t  micbist  register  os  chip _ _ _ - — - - - 

put  unsafe  value 

input: 

Pennenant  variable  Yi  and  argument  register  Xj  (in  memory) 

output: 

function 

If  Yi  IS  dereferenced  to  an  unbound  vanable,  a  new  variable 

IS  craeted  on  the  heap  and  is  put  into  Xj  Otherwise,  the  dereferenced 
valuf  of  Yi  is  put  into  Xj  - — . - 

Instruction 

best  cache  worst 

put_unsafe_value 
move  1 
dereference 

Lref: 
emp  1 
bgt 
trail 
move  1 
move  1 

Lnref: 
move  1 

-4*i(E),D0 
(DO,  Lref,  Lnref) 

#stack_base,DO 

Lnref 

DO.AO 

H.DO 

H,(H)+ 

DO.Xi(MP) 

3 (1/0/0) 

5(0) 

0 (1/0/0) 
1/3 

4(0) 

0 (0/0/0) 

4  (0/0/1) 
6(0) 

3  (0/0/1) 

7 (1/0/0) 

12(0) 

2+4  (1/0/0) 
4/6 

20  (0) 

2  (0/0/0) 

4  (0/0/1) 

12(0) 

5  (0/0/1) 

8 (1/2/0) 

17(5) 

3+  5  (1/2/0) 

5/8 

28  (7) 

3  (0,1/0) 

5 (0/1/1) 

16(3) 

7  (0,'1/1) 

Xi  in  mem  nref 

QQII 

32+  d  (8) 

ref  no  move 

50+  d  (15) 

ref  move 

20+ d  (4) 

60+  d  (4) 

82+d(24) 

A.4.  General  Unification  Routine 


input: 

DO,  Dl 

ontput: 

temps: 

DO.Dl.S 

renenl  unification  subroutine 

Instruction 

best  cnebe 

seorst 

dereftrenct 

Lref; 

dtrtferenct 

nnify_rr; 

Lrr; 

bclr 

bmi 

bclr 

bmi 

cmp.l 

bit 

1: 

mov»  I 
exg 
2: 

move! 

trail 

rt! 


(DO,  Lref,  Lnref) 
(Dl,  Lit,  Lnir) 


#cdr,D0 

10 

#cdr,Dl 

20 

D0,D1 

2 

DO, AO 
D0,D1 

DO,(AO) 

Dl,AO 


S(0) 


5(0) 

1 (0/0/0) 

1/3 

1 (0/0/0) 

1/3 

0  (0/0/0) 

1/3 


12(0) 


12(0) 

*  (0/0/0) 

4/6 

4  (0/0/0) 

4/6 

2 (0/0/0) 

4/6 


0 (0/0/0)  2  (0/0/0) 
0 (0/0/0)  2  (0/0/0) 


3 (0/0/1) 

4+  t  (0) 

0  (1 '0/0) 


4  (0/0/1) 
20+ 1  (0) 
10  (1/0/0) 


17  (5) 


17(5) 

5  (0/1/0) 
5/9 

5  (0/1/0) 
5/9 

3  (0/1/0) 

5/9 

3  (0/1/0) 

3  (0/1/0) 

5  (0/1/1) 
28+  t  (7) 
12  (1/2/0) 


10 

bclr 

#cdr,Dl 

bmi 

15 

emp  1 

D0,D1 

bit 

2 

exg 

D0,D1 

11: 

bset 

#cdr.D0 

movei  1 

Dl.AO 

bset 

#cdr,Dl 

bra 

O 

IS 

cmp.l 

D0,D1 

bit 

11 

tr 

DO.Dl 

1 (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

1/3 

4/6 

5/9 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

1/3 

4/6 

5/9 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

1  (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

1  (0/0/0) 

4  (0/0/0) 

6 (0/1/0) 

1/3 

4/6 

5/9 

0 (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

1/3 

4/6 

5/9 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

worst 


-  so  - 


unifyjist 

input 

output 

temps 

function 

DO,  D1 

DO.Dl, S,A1, AO, PDL 
loop  to  unify  lists 

Instruction 

best 

cncbe 

wont 

unify  list: 

move  1 

(PDL)+  ,D1 

A  (1/0/0) 

6(1 /0/D) 

7  (1/1/0) 

move  1 

(PDL)+  ,D0 

A  (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

unifyjist 
add  b 

Oxfc.DO 

O-e  0  (0/0/0) 

2+2  (0/0/0) 

3+  3  (0/1/0) 

ndd  b 

Oxfc.Dl 

O-e  0  (0/0/0) 

2+  2  (0/0/0) 

3+  3  (0/1/0) 

moven  1 

DO,S 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

moven  1 

Dl.Al 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

move  1 

(S)-t-  ,D0 

A  (1/0/0) 

6  (1/0/0) 

7 (1/1/0) 

move  I 

(Al)+  ,D1 

4  (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

bra 

Joop 

3 

6 

0 

continue 

move  1 

(SK  .DO 

4 (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

bmi 

4 

1/3 

4/6 

5/9 

1; 

2(0). 

8(0). 

12(4). 

move! 

(Al)+  ,D1 

4  (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

bge 

Joop 

1/3 

4/6 

5/9 

iecdr 

Lcr: 

(Dl.Al, Lcr, Lfiil,  Joop) 

5(0) 

28  (0) 

35  (6) 

subq  1 

#fS 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

move  I 

S,D0 

0  (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

* 

or.l 

#cdrJisttig,DO 

0+  0  (0/0/0) 

2+4  (0/0/0) 

3+  5  (0/2/0) 

3: 

move  1 

D0,(A1) 

3 (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

trait 

Dl.Al 

4(0) 

20  (0) 

28(7) 

rts 

9 (1/0/0) 

10  (1/0/0) 

12  (l/2'O) 

itcdr 

LcvO 

(DO,S,LcvO,LcoO,l) 

5(0) 

28  (0) 

35  (6) 

move! 

(Al)+  ,D1 

4  (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

bmi 

6 

1/3 

4/6 

5/9 

5: 

subq  1 

#<,A1 

0  (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

move  1 

A1,D0 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

move!  1 

S.Al 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

bra 

2 

3 

6 

9 

iccdr 

Lcvl: 

(Dl,Al,Lcvl,Lcol,Lncl) 

5(0) 

28  (0) 

35  (7) 

cmp.l 

DO.Dl 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

bgt 

7 

1/3 

4/6 

5/9 

exg 

D0,D1 

0  (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

7: 

bclr 

#cdr,Dl 

1  (0/0/0) 

4  (0/0/0) 

5 (0/1/0) 

bset 

#cdr.D0 

1 (0/0/0) 

4 (0/0/0) 

5 (0/1/0) 

moven.I 

Dl.AO 

0  (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

bra 

3 

3 

6 

9 

anifyjist  (coat ) 


belt 


ucbe 


•oni 


Lncl: 

bclr 

exg 

#cdr,D0 

DO, AO 

2 

2  (0)  «  (0) 

1  (0/0/0)  1 (0/0/0) 

0 (0/0/0)  2  (0/0/0) 

3  6 

12  (4) 

6 (0/1/0) 

3 (0/1/0) 

g 

Lcol: 

ex| 

bclr 

n>ove*.l 

DO.Dl 

#cdr,Dl 

D1,A0 

3 

6  (0)  U  (0) 

0 (0/0/0)  2 (0/0/0) 

1  (0/0/0)  1  (0/0/0) 

0 (0/0/0)  2 (0/0/0) 

3  6 

17  (3) 

3 (0/1/0) 

S (0/1/0) 

3 (0/1/0) 

g 

LcoO. 

movt.l 

bge 

itcdr 

Lcv2 

bclr 

movei.l 

(A1),D1 

Lfail 

(Dl,Al,Lcv2,Lco2,Lfail) 

#cdr,Dl 

Dl,AO 

3 

6  (0)  H  (0) 

3 (1/0/0)  6 (1/0/0) 
1/3  4/6 

5  (0)  28  (0) 

1 (0/0/0)  4  (0/0/0) 

0 (0/0/0)  2  (0/0/0) 

3  6 

17  (3) 

7  (1/1/0) 

s/g 

35(6) 

5 (0/1/0) 

3 (0/1/0) 

g 

Lco2 
cmp  I 
bDe 

DO.Dl 

_UDify_cdr 

6  (0)  14  (0) 

0 (0/0/0)  2 (0/0/0) 
1/3  4/6 

g  (1/0/0)  10(1/0/0) 

17  (3) 

3 (0/1/0) 

S/g 

12  (1/2/0) 

_onify_cdr: 

ltil_0lruel 

bclr 

bclr 

(DO.  Lfail) 

(Dl,  Lfail) 

#cdr,D0 

#cdr.Dl 

nnify_struct 

1/3  16/18 

1/3  16/18 

1 (0/0/0)  4  (0/0/0) 

1 (0/0/0)  4  (0/0/0) 

3  6 

23/27 

23/27 

5 (0/1/0) 

5 (0/1/0) 

g 

Joop 

itrtjtrtnct 

Lv; 

(DO,  Lv,  Lnv) 

5  (0)  12  (0) 

17  (5) 

dtrejtrenct 

Lvov: 

mov»  1 

bclr 

beq 

exg 

move! 

tDove.l 

bra 

(Dl.  Lw,  Lvnv) 

DO.AO 

#cdr,D0 

01 

DO.AO 

D1,(A0) 

DO.Dl 

01 

6  (0)  12  (0) 

0 (0/0/0)  2  (0/0/0) 

1 (0/0/0)  4  (0/0/0) 

1/3  4/6 

0 (0/0/0)  2 (0/0/0) 

3 (0/0/1)  4  (0/0/1) 

0  (0/0/0)  2  (0/0/0) 

3  6 

16  (3) 

3 (0/1/0) 

5 (0/1/0) 

S/g 

3 (0/1/0) 
S (0/1/1) 
3 (0/1/0) 

g 

iBStmCtlOD 


Lw; 

bclr 

bmi 

bclr 

bmi 

cmp.l 

bit 

01. 

movcB.I 

exg 

02 

move  I 

trail 

brx 


10: 
bclr 
bmi 
cmp.l 
bit 
exg 
11: 
beet 
movea.  I 
bset 
bra 


#cdr,D0 

10 

#cdr,Dl 

20 

D0,D1 

02 


noifyjist  (cent 


best 


5(0) 

1  (0/0/0) 

1/3 

1  (0/0/0) 

1/3 

0 (0/0/0) 

1/3 


cache 


12(0) 

*  (0/0/0) 
d/6 

4  (0/0/0) 
d/6 

2 (0/0/0) 
d/6 


17  (5) 

5  (0/1/0) 
5/9 

5 (0/1/0) 

5/9 

3 (0/1/0) 

5/9 


InstructioD 

Lbs: 

bclr 

bclr 
cmp  1 
bcq 

$vitch_lag 

#cdr,D0 

#cdr,DI  1  (0/0/0) 

DO.Dl 

_continue 

(DO,  Lfnil,  LI,  Lfnil,  Li) 

unify  lilt  (cont ) 

beat 

6(0) 

1 (0/0/0) 

4  (0/0/0) 

0  (0/0/0) 

1/3 

8 

cache 

12  (0) 

4 (0/0/0) 

5 (0/1/0) 

2 (0/0/0) 

4/6 

27 

norat 

16  (3) 

5 (0/1/0) 

3 (0/1/0) 

5/9 

36 

LI 

tettjiit 

les 

movc.l 

move! 

move.) 

bn 

(Dl,  Lfnil) 

_onify_li>t,AO 

D0,-(PDL) 

D1,-(PDL) 

AO.  -(PDL) 

_continne 

6/4 

2+  2  (0/0/0) 

8 (0/0/1) 

8 (0/0/1) 

3 (0/0/1) 

3 

12/10 

2+  2  (0/0/0) 

6 (0/0/1) 

6 (0/0/1) 

5 (0/0/1) 

6 

16/12 

8+  3  (0/2/0) 

6 (0/1/1) 

6  (0/1/1) 

6  (0/1/1) 

9 

Ls 

leil_ilruel 

les 

move  1 
move  1 
move  1 

(Dl,  Lfnil) 

_BBify_«truct,AO 

D0,-(PDL) 

Dl,-(PDL) 

AO,  -(PDL) 
continue 

3/1 

2+  2  (0/0/0) 

3 (0/0/1) 

3  (0/0/1) 

3 (0/0/1) 

3 

18/16 

2+  2  (0/0/0) 

S  (0/0/1) 

5  (0/0/1) 

S (0/0/1) 

6 

27/23 

3+  3  (0/2/0) 

6 (0/1/1) 

6 (0/1/1) 

6 (0/1/1) 

9 

Lfail: 

move! 

bn 

pdlBise(MP),PDL 

fail 

3 (1/0/0) 

3 

7 (1/0/0) 

6 

9 (1/2/0) 

9 

cycles 

entry 

_unify_li3t 
nnifv  list 

18(4) 

11  (2) 

42  (4) 

30(2) 

55  (16) 

41  (12) 

per  elmemeBt  _  — .  . . 

▼»r,vir 

vir.nvir 

avir.vir 

cooft, const 

listjist 

itnic.struc 

46+  d+  t 

38+  d+  t  (3) 

39+ d+  t  (3) 

28+  d+  c  (2) 
53+  d+  ul  (6) 
50+  d+  Ul  (6) 

(4) 

09+  d+  t  (3) 

94+  d+  t  (3) 

58+  d+  c  (2) 

114+ d+  ul  (6) 
120+ d+  ua  (6) 

108+d+t  (4) 

131+  d+t  (32) 

127+ d+t  (38) 

70+  d+  c  (16) 

149+  d+  ul  (32) 

160+  d+  us  (37) 

exit  - - - - - - 

const.coDst 

c_v»r,ncdr 

Bcdr,c_v»r 

vnr.vir 

etruc.jtruc 

30+  d+  c  (1) 
31+  d+  c  (3) 
19+ d+  c  (2) 
41+d+  c+t  (3) 
35+  d+  ua  (3) 

64+d+c(l) 

92+  d+  c  (3) 

70+  d+  c  (2) 
126+ d+ c+t  (3) 
84d+  Ul  (3) 

80+  d+  c  (15) 

122+  d+  c  (28) 

90+  d+  c  (22) 

164+  d+c+  t  (37) 
113+  d+  ua  (31) 

d  lime  needed  for  dereference  t  time  needed  for  tnil 
c  time  needed  for  decdr 

ul  nnd  us  are  the  time  needed  to  unify  t  list  or  n  structure  respectively - - - 1 
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aaify  itmct 

iopnt; 

DO.  D1 

outpat: 

temps: 

fnortioD 

D0,D1,S.AI,A0.PDL 
gfner&l  nnificitioo  rontiac  for  ilnict 

iDstructioD 

_nnify_«truct: 

move.  I 

move! 

nnify_«tnict: 

and  b 

lad.b 

movea 

move! 

move! 

cmp  I 

beq 

bra 


(PDL)+  ,D1 
(PDL)+  ,D0 

#xfc,Dl 

#xfc,D0 

D0,S 

D1,A1 

(S)+  ,D0 

(Alh.DO 

loop 

Lfail 


*  (1/0/0) 

*  (1/0/0) 

0+  0  (0/0/0) 
0+  0  (0/0/0) 
0 (0/0/0) 

0 (0/0/0) 

*  (1/0/0) 

0+  4  (1/0/0) 

1/3 

3 


«  (1/0/0) 

6  (1/0/0) 

2+  2  (0/0/0) 
2+2  (0/0/0) 
2  (0/0/0) 

2  (0/0/0) 

6  (1/0/0) 
2+4  (1/0/0) 
4/6 
6 


7 (1/1/0) 

7 (1/1/0) 

3+  3  (0/2/0) 
3+  3  (0/2/0) 
3  (0/1/0) 

3 (0/1/0) 

7 (1/1/0) 
3+4(J/l/0) 
5/0 
0 


_coDtinue 

move! 

(S)+  ,D0 

4  (1/0/0)  6  (1/0/0) 

7 (1/1/0) 

bmi 

50 

1/3  4/6 

5/9 

move.l 

(Al)+  .Dl 

4  (1/0/0)  6  (1/0/0) 

7 (1/1/0) 

bmi 

Lfail 

1/3  4/6 

5/9 

loop: 

derefertBce 

(DO,  Lref,  Lnref) 

Lref; 

5  (0)  12  (0) 

17  (5) 

dereference 

(Dl,  Lit,  Lm) 

Lrn: 

move.l 

DO,AO 

0 (0/0/0)  2 (0/0/0) 

3 (0/1/0) 

bclr 

#cdr,D0 

1  (0/0/0)  4  (0/0/0) 

5 (0/1/0) 

beq 

1 

1/3  4/6 

5/9 

DO.AO 

0 (0/0/0)  2  (0/0/0) 

3 (0/1/0) 

bset 

#cdr,Dl 

1  (0/0/0)  4  (0/0/0) 

5 (0/1/0) 

move  1 

D1,(A0) 

3  (0/0/1)  4  (0/0/1) 

6 (0/1/1) 

exg 

DO.DI 

0 (0/0/0)  2 (0/0/0) 

3 (0/1/0) 

brx 

3 

3  6 

9 

Lrr 

5  (0)  12  (0) 

17  (5) 

bclr 

#cdr,D0 

1  (0/0/0)  4  (0/0/0) 

5 (0/1/0) 

bmi 

10 

1/3  4/6 

5/9 

bclr 

#cdr.Dl 

1  (0/0/0)  4  (0/0/0) 

5 (0/1/0) 

bmi 

20 

1/3  4/6 

5/9 

cmp.l 

DO.DI 

0  (0/0/0)  2  (0/0/0) 

3 (0/1/0) 

bit 

2 

1/3  4/6 

5/9 

1: 

movea  1 

DO.AO 

0 (0/0/0)  2 (0/0/0) 

3 (0/1/0) 

exg 

DO.DI 

0 (0/0/0)  2 (0/0/0) 

3 (0/1/0) 

2: 

move.l 

DO.(AO) 

3  (0/0/1)  4  (0/0/1) 

5 (0/1/1) 

3: 

trail 

Dl.AO 

t-t- 1  (0)  20+ 1  (0) 

28+ t  (7) 

rts 

9  (1/0/0)  10(1/0/0) 

12  (1/2/0) 

11: 

baet 

movea.l 

baet 

#cdr,DO 

D1,A0 

#cdr,Dl 

2 

1  (0/0/0) 

0 (0/0/0) 

1  (0/0/0) 

1/3 

4  (0/0/0) 

2 (0/0/0) 

4  (0/0/0) 

4/6 

3 (0/1/0) 

3 (0/1/0) 

6 (0/1/0) 

3/9 

13: 

cmp.l 

bit 

exg 

bra 

D0,D1 

11 

D0,D1 

11 

0 (0/0/0) 

1/3 

0 (0/0/0) 

3 

4 (0/0/0) 

4/6 

2  (0/0/0) 

6 

3 (0/1/0) 

3/9 

3  (0/1/0) 

8 

Larcf. 

dereference 

Lnr: 

btst 

beq 

baet 

move. I 

bra 

Lnn: 

cmp.l 

beq 

g-witcb.tag 

LI: 

testjist 

lea 

move! 
move  I 
move! 
bra 
La; 

te3t_atruct 

lea 

move! 
move  I 
move! 
bra 


(Dl,  Lnr,  Lnn) 

#cdr,Dl 

2 

#cdr,DO 
DO, (AO) 

3 

DO.Dl 

_continne 

(DO,  Lfail,  LI,  Lfail,  La) 
(Dl,  Lfail) 

_nDify_liat,AO 

DO,-(PDL) 

D1,-(PDL) 

AO,-(PDL) 

_coDtinue 

(Dl,  Lfail) 
_UDify_atrurt,AO 

DO,-(PDL) 

D1,-(PDL) 

AO,-(PDL) 

continue 


0  (0/0/0) 

1/3 

3 _ 

6(0) 

S(0) 

1 (0/0/0) 

3 

1 (0/0/0) 

3  (0/0/1) 

3 

6(0) 

0 (0/0/0) 

1/3 

4 


2 (0/0/0) 

4/6 

6 _ 

12(0) 

12(0) 

*  (0/0/0) 

6 

4  (0/0/0) 

4  (0/0/1) 

6 

12(0) 

2 (0/0/0) 

4/6 

18 


2+ 2  (0/0/0)  2+ 2  (0/0/0) 


3 (0/0/1) 

3 (0/0/1) 

3  (0/0/1) 

3 

1/3 

2+  2  (0/0/0) 

3  (0/0/1) 

3  (0/0/1) 

3  (0/0/1) 

3 


3  (0/0/1) 

5  (0/0/1) 
i  (0/0/1) 

6 

16/18 

2+  2  (0/0/0) 
6 (0/0/1) 

S (0/0/1) 

6 (O/O/I) 

6 


3 (0/1/0) 

S/8 

8 _ 

16(0) 

17  (3) 

3  (0/1/0) 

8 

3 (0/1/0) 

3 (0/1/1) 

8 

16(3) 

3  (0/1/0) 

3/9 

23 

12/16 

3-t-  3  (0/2/0) 

6 (0/1/1) 

6 (0/1/1) 

6(0/l/l) 

8 

23/27 

3+  3  (0/2/0) 

6  (0/1/1) 

6  (0/1/1) 

6  (0/1/1) 

8 


-  36  - 


unlfy_8truct  {cont 

) 

Instruction 

best 

cache 

worst 

50: 

move! 

(A1),D1 

S (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

bge 

Lfail 

1/3 

4/6 

5/8 

cmp.l 

DO.Dl 

0 (0/0/0) 

2  (0/0/0) 

8  (0/1/0) 

bne 

Lfail 

1/3 

4/6 

5/8 

cmp.l 

#cdr_nil,D0 

0+  0  (0/0/0) 

2-t-  4  (0/0/0) 

3+5  (0/2/0) 

bne 

Lfail 

1/3 

4/6 

5/8 

rts 

0 (1/0/0) 

10  (1/0/0) 

12(1/2/0) 

Lfail: 

move! 

pdlBase(MP),PDL 

3 (1/0/0) 

7 (1/0/0) 

8  (1/2/0) 

bra 

Jail 

3 

6 

8 

cycles 

entry 

nnifv-Struct 

18  (■») 

42  (4) 

55  (16) 

unify_struct 

11  (2) 

30  (2) 

41 (12) 

per  element 
var.var 

46-+-  d+  t  (4) 

108-t-  d+ 1  (4) 

150+  d+t  (42) 

nvar.var 

38+d+t  (3) 

88+  d+ 1  (3) 

131+  d+t  (32) 

var.nvar 

38+d+t(3) 

84+  d+t  (3) 

127+  d+t  (38) 

const, const 

25 +d  (2) 

52+  d  (2) 

68+ d  (16) 

list, list 

47+  d+  ul  (S) 

103+  d+  ul  (5) 

134+  d+  ul  (25) 

struc.struc 

d+  us  (5) 

108+  d+  ns  (5) 

145+  d+  us  (25) 

exit 

17  (2) 

38  (2) 

49  (13) 

d  time  needed  for  dereference 
t  time  needed  for  trail 

ul  and  us  time  needed  to  unify  a  list  or  a  structure  respectively 
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A.5.  Read  Unification  Routines 


■Bify_v»riible_X 

iopot: 

Xi,  S 

ostpnt; 

temp: 

faactioB: 

read  mode  oBificatioB  vhea  Xi  ia  Kfitter 

iBStniCtiOB 

best 

eacbc 

wont 

r_BBify_v»n»bltX 

Xr 

tfi 

register 

move! 

(S)+  ,Xi 

4  (1/0/0) 

6  (1/0/0) 

7 (1/1/0) 

bge 

Lo 

1/3 

4/6 

8/8 

itcir 

(Xi.S,Lcr,Jail.Lo) 

Lcr. 

8(0) 

26  (0) 

38  (6) 

move! 

H.Dl 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

or.l 

#cdrJi3ttag,Dl 

0+  0  (0/0/0) 

2+4  (0/0/0) 

3+  8  (0/2/0) 

move! 

D1,(S) 

3  (0/0/1) 

4  (0/0/1) 

8 (0/1/1) 

bset 

#cdr,Xi 

1  (0/0/0) 

4  (0/0/0) 

8 (0/1/0) 

trail 

Xi,S 

d(0) 

20  (0) 

28(7) 

move! 

Dl,Xi 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

move.l 

4  (0/0/1) 

4  (0/0/1) 

8 (0/1/1) 

bra 

write_BBify_ 

3 

e 

8 

Lo: 

2(0) 

«(0) 

12  (4) 

re»d_nBify_ 

cycles 

base  case 

7(1) 

12(1) 

16(4) 

iBvoke  decdr 

7+c(l) 

18+c(l) 

24+ c  (7) 

chaBge  to  w  mode 

25+e+t  (3) 

86+  e+ 1  (3) 

113*  C+t  (27) 

c  time  seeded  to  decdr 

t  time  seeded  to  trail 

i 


i 
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unify  viri»ble_X 

input: 

Xi  in  memory 

output: 

function: 

unify  in  re»d-niode  when  Xi  in  tb*  memoiy 

Instruction 

best 

cube 

wont 

unify  _vuri»ble_X: 

roove.l 

bge 

{S)-t-  ,D0 

Lo 

4  (1/0/0) 

1/3 

6  (1/0/0) 

4/6 

7 (1/1/0) 

S/8 

dtedr 

Lcr 
move. I 

or.l 

move! 

bset 

trail 

move  I 
move  I 

(DO,S,Lcr,_fiil,Lo) 

H,D1 

#cdrJistt»g,Dl 

D1,(S) 

#cdr,D0 

D0,S 

Dl,Xi(MP) 

H,(H)-h 

S(0) 

0  (0/0/0) 

O-t-  0  (0/0/0) 

3  (0/0/1) 

1  (0/0/0) 

4(0) 

3  (0/0/1) 

4  (0/0/1) 

28(0) 

2 (0/0/0) 

2+  4  (0/0/0) 

4 (0/0/1) 

4  (0/0/0) 
20(0) 

5 (0/0/1) 

4  (0/0/1) 

3S  (6) 

3  (0/1/0) 

3-h  S  (0/2/0) 

S (0/1/1) 

S (0/1/0) 

28  (7) 

7  (0/1/1) 

5 (0/1/1) 

bra 

Lo: 

move  1 
rend  unify_ 

write_unify_ 

D0,Xi(MP) 

3 

2(0) 

3  (0/0/1) 

6 

8(0) 

S (0/0/1) 

8 

12(4) 

7  (0/1/1) 

cycles 

bnse  cue 
decdr 

change  to  w  mode 

10(2) 

10-h  c  (2) 

28-r  C-+  t  (4) 

17  (2) 

23-h  c  (2) 
89-t■c-^t  (4) 

23  (6) 

3H-  c  (9) 
in-hc-ft  (29) 

c  time  Deeded  to  decdr 
t  time  needed  to  trail 


unify_v»h»ble_Y 

input: 

Yi 

output 

function: 

nnify  in  reid  mode 

Instruction 

best 

cache 

worst 

r  unify  viriibleY 

move.l 

(S)-t-  ,D0 

4  (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

bge 

Lo 

1/3 

4/6 

6/9 

itcdr 

(DO,S,Lcr,_f»il,Lo) 

Lcr: 

5(0) 

28  (0) 

35  (6) 

move.l 

H,D1 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

or.l 

#cdr  listUg.Dl 

(H-  0  (0/0/0) 

2-t-  4  (0/0/0) 

3+5  (0/2/0) 

move  1 

D1,(S) 

3 (0/0/1) 

4 (0/0/1) 

5  (0/1/1) 

bset 

#cdr,D0 

1  (0/0/0) 

4 (0/0/0) 

5  (0/1/0) 

trail 

D0,S 

4(0) 

20(0) 

28  (7) 

move  1 

D0,-4*i{E) 

3 (0/0/1) 

5 (0/0/1) 

7  (0/1/1) 

move  1 

H,(H)-h 

4  (0/0/1) 

4 (0/0/1) 

5  (0/1/1) 

bn 

wnte_unify_ 

3 

6 

9 

Lo: 

2(0) 

8(0) 

12(4) 

move! 

D0,-4*i(E) 

3 (0/0/1) 

5  (0/0/1) 

7  (0/1/1) 

re»d  nnify_ 

cycles 

buse  cue 

10(2) 

17  (2) 

23  (6) 

decdr 

10-t-  c  (2) 

23+  c  (2) 

31+  c  (9) 

change  to  w  mod« 

28-t-c+t  (4) 

89+  c+t  (4) 

117+  c+t  (29) 

c  time  needed  to  decdr 

t  time  needed  to  tnil 
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input: 

ontpnt: 

fnnction: 


Xi  in  register 
nnifv  in  reid  mode 


Instruction 


best 


cncbe 


seorst 


r_nnify_v*lueX: 


movc.l 

bge 

itcir 

Ler 

move.  I 

or.l 

move.) 

bset 

trait 

rfere/erenee 
Lref: 
cmp.l 
bgt 
move.  I 
trail 
move! 
bra 
1: 

move.l 

bclr 

move.l 

bra 

Lnref 

move.l 

bra 

Lo; 

move.l 

move.l 

brs 

read_uBify_ 


(S)+  .DO 

*  (1/0/0) 

6 (1/0/0) 

Lo 

1/3 

4/6 

(DO,S,Lcr,_fail,Lo) 

5(0) 

28(0) 

H,D1 

0  (0/0/0) 

2 (0/0/0) 

^cdr  listtag.Dl 

0+  0  (0/0/0) 

2+  *  (0/0/0) 

Dl.(S) 

3  (0/0/1) 

4  (0/0/1) 

#cdr,D0 

1 (0/0/0) 

4  (0/0/0) 

D0,S 

i  (0) 

20  (0) 

(Xi, Lref, Lnref) 

5(0) 

12(0) 

#8tack_ba3e,Xi 

0+  3  (0/0/0) 

2+  4  (0/0/0) 

1 

1/3 

4/6 

H,{A0) 

3 (0/0/1) 

4 (0/0/1) 

Xi,A0 

4(0) 

20(0) 

H,(H)+ 

4  (0/0/1) 

4  (0/0/1) 

write_unify_ 

3 

6 

Xi.DO 

0 (0/0/0) 

2 (0/0/0) 

#cdr,D0 

1  (0/0/0) 

4 (0/0/0) 

D0,(H)+ 

4  (0/0/1) 

4 (0/0/1) 

wnte_nnify_ 

3 

6 

6(0) 

12  (0) 

Xi,(H)+ 

i  (0/0/1) 

4  (0/0/1) 

wnte_unify_ 

3 

6 

2(0) 

8(0) 

PDL,pdlBase(MP) 

3 (0/0/1) 

5  (0/0/1) 

Xi.Dl 

0 (0/0/0) 

2 (0/0/0) 

unify 

5  (0/0/1) 

7 (0/0/1) 

7  (1/1/0) 

5/8 

35  (6) 

3 (0/1/0) 

3-t  5  (0/2/0) 

5 (0/1/1) 

5  (0/1/0) 
28(7) 

17  (5) 

3-r  5  (0/2/0) 
5/8 

5  (0/1/1) 

28  (7) 

5 (0/1/1) 

8 

3  (0/1/0) 

5 (0/1/0) 

5 (0/1/1) 

8 

16(3) 

5 (0/1/1) 

8 

12(<) 

7 (0/1/1) 

3  (0/1/0) 
13(0/2/1) 


cycles 


base  case 
decdr 

change  to  w  mode 


15- 

15- 

25- 


• «  (3) 

■  c+  u  (3) 
-  c+  t  (3) 


32+  c+  u  (3) 
g4-e  c+t  (3) 


38-^  u  (10) 

47+  c+  u  (13) 
110+  c+  t  (26) 


c  time  needed 
t  time  needed 
u  time  needed 


to  decdr 
to  trail 
to  unify 


V 


V. 
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nnify_value_Y 

input: 

Yi 

1  oatpot: 

fcDction: 

ooifv  Yi  io  read  mode 

- 

Instruction 

best 

cache 

worst 

r  unify  vilueY.  ,  , 

move  I 

.4*i(E),Dl 

3 (1/0/0) 

7  (1/0/0) 

e (1/2/0) 

move! 

(S)-l-  ,D0 

4  (1/0/0) 

e (1/0/0) 

7  (1/1/0) 

bge 

Lo 

1/3 

4/6 

5/8 

deedr 

(DO.S.Lcr.Jail.Lo) 

S(0) 

28(0) 

35  (6) 

m oven  1 

D1,A1 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

xnove.l 

H.Dl 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

or  1 

#cdrJisttag.Dl 

O-t-0  (0/0/0) 

2+  4  (0/0/0) 

3+  5  (0/2/0) 

move! 

D1,(S) 

3  (0/0/1) 

4  (0/0/1) 

6  (0/1/1) 

bset 

#cdr,D0 

1  (0/0/0) 

4 (0/0/0) 

5 (0/1/0) 

trail 

D0,S 

4  (0) 

20(0) 

28  (7) 

move.l 

Al.Dl 

0  (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

derefertnet 

Lref. 

(Dl, Lref, Lnref) 

S(0) 

12  (0) 

17  (5) 

cmp.l 

#stack_ba3e,Dl 

O-t-  3  (0/0/0) 

2+4  (0/0/0) 

3+  5  (0/2/0) 

bgt 

S 

1/3 

4/6 

5/0 

bclr 

#cdr,Dl 

1  (0/0/0) 

4 (0/0/0) 

5  (0/1/0) 

m  oven  1 

Dl,A0 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

1/3 

4/6 

5/0 

move.) 

H,(A0) 

3 (0/0/1) 

4  (0/0/1) 

5  (0/0/1) 

1: 

trail 

Dl.AO 

<  (0) 

20(0) 

28  (7) 

move.l 

H,(H)* 

4  (0/0/1) 

4  (0/0/1) 

5  (0/1/1) 

bra 

wnte_unify_ 

3 

6 

0 

2: 

move  1 

H,D0 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

bset 

#cdr.D0 

1  (0/0/0) 

4 (0/0/0) 

Mo/1/0) 

move  1 

DO.(AO) 

3  (0/0/1) 

4 (0/0/1) 

5 (0/1/1) 

bset 

#cdr.Dl 

1  (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

bra 

1 

3 

6 

0 

b 

Lnref: 

6(0) 

12  (0) 

16(0) 

bclr 

#cdr,Dl 

1  (0/0/0) 

4 (0/0/0) 

5 (0/1/0) 

move  1 

Dl,(H)-^ 

4(0/0/l) 

4(0/0/l! 

6  (0/1/1) 

bra 

wnte_unify_ 

3 

6 

0 

Lo: 

2(0) 

8(0) 

12  (4) 

move  1 

PDL,pdlBase(MP) 

3(0,0/!) 

5 (0/0/1) 

7  (0/1/1) 

brs 

_unify 

6 (0/0/1) 

7 (0/0/1) 

13  (0/2/1) 

read  unify  _ 

cycles 

base 

18-t  u  (4) 

31+  u  (4) 

45  (12) 

deedr 

18-t-c-t  n  (4) 

37+  c+  u  (4) 

53+  c+  u  (15) 

cbaDge  to  w  mode 

35-tc-H(4) 

111  +  c+t  (4) 

146+  c+  t  (35) 

DO  move 

40-t-c-t-  d-t-t  (4) 

123+ c+  d+t  (4) 

164+ c+ d+t  (41) 

move 

53+  c+  d+  t  (5) 

16B+C+  d+t  (5) 

227+ c+ d+t  (57) 

c  time  Deeded  to  deedr 
d  time  Deeded  to  derefereDce 

1  time  Deeded  to  trail 

D  time  Deeded  to  UDify 

unify  _cca8t^P^ 


npat 

mtpnt. 

[naction _ 

lastruetion _ 

r_anify_coastxBtC; 

movC'l 

bje 

iecdr 

Ler 
move  I 
orl 

move  1 

bset 

(roil 

move! 

bi» 

Uo; 

move. I 

exg 

move.! 

brs 

rvad  aBify_ 
cycles 


CoBStBBt  C 

uaify  witbC^ 


(S)+  .DO 

Lo 

(DO.S.Lcr.Jwl.bo) 

H.Dl 

#cdrJisttig,Dl 

D1,(S) 

#cdr,D0 

D0,S 

#C.(HH 

»nle_nBify_ 

#C,D1 

DO.Dl 

PDL,pdlBue(MP) 

oBifyl 


4 (1/0/0) 

1/S 


s(0) 

0 (0/0/0) 

0+  0  (0/0/0) 
s (0/0/1) 

1 (0/0/0) 

4(0) 

4  (0/0/1) 

3 

2(0) 

0+0  (0/0/0) 
0 (0/0/0) 

3 (0/0/1) 

5 (0/0/1) 


6  (1/0/0) 

4/6 

28(0) 

2  (0/0/0) 
2+4  (0/0/0) 

4  (0/0/1) 

4  (0/0/0) 

20(0) 

8  (0/0/1) 

6 

8(0) 

)  6  (0/0/0) 

2  (0/0/0) 

S  (0/0/1) 

7  (0/0/1) 


7  (1/1/0) 

s/« 

35  (6) 

3 (0/1/0) 

S+ 5  (0/2/0) 

5 (0/1/1) 

5  (0/1/0) 

28  (7) 

7  (0/1/1) 

S 

12  (4) 

5 (0/1/0) 

3 (0/1/0) 

7  (0/1/1) 
13  (0/2/1) 


bise  case 
decdr 

fhxBge  to  ^  mode_ 


,,+  ,+  K3)  8g+c+2j!I_Jli- - 


t  e  time  seeded  to  decdr 
t  coe  seeded  to  trail 
u  t>me  needed  to  BS.fy_ 


iBput: 

output 

functioB 

instructioB 

I^UBify_“il 

I  move  ! 
bge 
iecdr 
Lcr 
move  1 


nnify  NIL  io  read  tnod^ 


(S),D0 

Jail 

(DO, S.Lcr, Lear, Jail) 

#cdr_oil,(S) 

#cdr,D0 


3  (1/0/0) 
1/3 

5(0) 

3  (0/0/1) 
1  (0/0/0) 


6  (1/0/0) 

4/6 

28(0) 

8 (0/0/1) 
4  (0/0/0) 
20(0) 


7  (1/1/0) 
5/9 

35  (6) 

7  (0/1/1) 
5  (0/1/0) 
28  (7) 


(rail 

bra 

Lesr: 
emp  1 

D0,S 

_COBllBUe 

#cdr_Bil,D0 

costiaue 

-1 

3 

6(0) 

0 (0/0/0) 

1/3 

6 

14(0) 

2+  4  (0/0/0) 
4/6 

6 

B 

17  (3) 

S+S  (0/2/0)  1 

s/w 

9 

beq 

bra 

cycles 

fail 

edr  sref 
edr  ref 

3 

13+  c  (l) 

20+  c+  t  (2) 

36+ c  (1) 
76+c+t  (2) 

46+  c  (11) 

96+  c+ 1  (21) 

c  time  seeded 
t  time  seeded 


to  decdr 
to  trail 


«nify_tdr 

iopat; 

Xi 

ootpot: 

foDCtioD 

Xi  is  QDififd  with  the  Cdr  of  k  list 

Instruction 

best 

ciebe 

worst 

r_nnify_cdrX: 

move.l 

(S),D0 

S (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

bge 

I 

1/3 

V6 

S/0 

dtcdr 

(DO,S,Lcr,Lcrn,Lo) 

Lcr 

5(0) 

28  (0) 

35  (6) 

bsft 

#cdr,D0 

1  (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

Lem; 

6(0)* 

U  (0)  • 

17  (3)  • 

move.l 

D0.Xi 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

bn 

_continue 

3 

6 

S 

Lo: 

2(0) 

8(0) 

12(4) 

subq 

#4,S 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

1: 

move.l 

S,Xi 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

or.l 

#cdrJistt»g,Xi 

0+0  (0/0/0) 

2+  4  (0/0/0) 

3+5  (0/2/0) 

bn 

_continue 

3 

6 

g 

cycles 

bnse  else 

fl(l) 

26(1) 

36(9) 

deedr  to  enref 

13+c  (1) 

32+ c  (1) 

41+  c  (9) 

deedr  to  cref 

13+c  (1) 

50+ e  (1) 

64+  c  (13) 

deedr  to  others 

9+c(l) 

34+ e  (1) 

47+c  (13) 

Xi  in  memory 

3(1)  more 

3  (1)  more 

4  (1)  more 

c  time  needed  to  deedr 

unify.cd  r_Y 

input 

output: 

function 

Yi 

unify  Yi  with  the  cdr  of  i  list 

Instruction 

best 

cache 

worst 

r  unifyedrY. 

move.] 

(S),D0 

3 (1/0/0) 

6 (1/0/0) 

7  (1/1/0) 

bge 

dtcdr 

1 

(D0,S,Lcr,Lcrn,L*o 

1/3 

4/6 

5/9 

Lcr; 

5(0) 

28  (0) 

35  (6) 

bset 

#cdr,D0 

1  (0/0/0) 

4 (0/0/0) 

5  (0/1/0) 

Lem: 

6(0) 

14  (0) 

17(3) 

move.l 

D0,-4«i(E) 

3  (0/0/1) 

5 (0/0/1) 

7  (0/1/1) 

bn 

_continue 

3 

6 

9 

Lo: 

2(0) 

8(0) 

12(<) 

subq 

#4.S 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

1: 

move.l 

S,D0 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

or.l 

#cdrJi3tt»g,D0 

0+  0  (0/0/0) 

2+  4  (0/0/0) 

3+  5  (0/2/0) 

move.l 

br& 

D0,-4*i(E) 

.continue 

3 (0/0/1) 

5 (0/0/1) 

7  (0/1/1) 

cycles 

bise  cue 

12  (2) 

31  (2) 

43(11) 

deedr  to  enref 

16+ c  (2) 

37+  c  (2) 

48+  c  (11) 

deedr  to  cref 

16+ c  (2) 

55+  c  (2) 

71+c(15) 

deedr  to  others 

12  (2) 

39  (2) 

54  (15) 

c  time  needed  to  deedr 
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snify  void  ^ 

inpst; 

ostpot: 

unify  untamed  variable  in  read  mode 

Isstmctioo 

best  cache 

worst 

r_Bnify_void; 

move.  I 

b*e 

itcdr 

Lcr 

move.) 

or.l 

move.  I 

beet 

trail 

move. I 

bra 

Lo: 

read_uBify_ 


(S)+  ,D0 

4  (1/0/0) 

6  (1/0/0) 

Lo 

1/3 

4/6 

(D0,S.Lcr,_fail,Lo) 

5(0) 

28  (0) 

H,D1 

0 (0/0/0) 

2  (0/0/0) 

#cdrJisttag.Dl 

0+  0  (0/0/0) 

2+  4  (0/0/0) 

D1,(S) 

3 (0/0/1) 

4  (0/0/1) 

#cdr,D0 

1  (0/0/0) 

4  (0/0/1) 

DO.S 

4(0) 

20  (0) 

H,(H)+ 

4  (0/0/1) 

4  (0/0/1) 

write_unify_ 

3 

6 

2(0) 

8  (0) 

base  case 

7(1) 

12(1) 

decdr 

7+c(l) 

18+c(l) 

chiDge  to  w  mode 

25+  c+t  (3) 

g4+  c+t  (3) 

7(l/t/0) 

S/9 

3S  (6) 

3 (0/1/0) 

3+  5  (0/2/0) 
S(0/l/l) 
S(0/1/I) 
28(7) 

S (0/1/1) 

9 

12  (4) 


cycles 


16  (4) 

24+ c  (7) 

101+ c+t  (27) 


c  time  seeded 
t  time  Seeded 


to  decdr 
to  trail 
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A.6.  Write  Unification  Routines 


nnify  variiblc 

input: 

output 

function 

Xi 

BDifv  Xi  in  writf  mode,  Xi  i&  register 

Instruction 

best 

cache 

worst 

w_nnify_v»nibleX: 
move! 
move  1 
write  unify. 

H,Xi  0  (0/0/0) 

Xi,(H)+  4  (0/0/1) 

2 (0/0/0) 
4 (0/0/1) 

8 (0/1/0) 

5  (0/1/1) 

cycles 

Xi  in  register  4  (l) 

6(1) 

8(3) 

unify  vnriibie.X  1 

input:  Xi 

output: 

function: 

unifv  Xi  in  write  mode  w 

len  Xi  is  in  memory 

Instruction 

best 

cache 

wont 

w_unify_viriibleX: 

move  I 

move  I 

move! 

write_unify_ 

H.DO  0  (0/0/0) 

D0,(H)+  4  (0/0/1) 

D0,Xi(MP)  3  (0/0/1) 

2 (0/0/0) 

4 (0/0/1) 

S  (0/0/1) 

3 (0/1/0) 

S (0/1/1) 

7 (0/1/1) 

cycles 

7  (2) 

n  (2) 

15  (5) 

unify  viriible.Y  1 

input:  Yi 
output: 

function 

unify  Yi  in  write  mode 

Instruction 

best 

cache 

wont 

w_unify_v»ri»bleY: 

move! 

move! 

move! 

H,D0  0  (0/0/0) 

D0,(H)+  4  (0/0/1) 

D0,-4*i(E)  3  (0/0/1) 

2  (0/0/0) 

4  (0/0/1) 

5  (0/0/1) 

3 (0/1/0) 

5  (0/1/1) 

7  (0/1/1) 
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anify  v»lue_X  | 

input:  Xi  in  register 

output; 

function 

unifv  Xi  Need  to  dereference  to  handle  unsafe  value 

Instruction 

best 

cache 

isorst 

cycles 

w  unify  v».lue_X: 

7  (2) 

11  (2) 

15  (5) 

dtrtjtrtnct 

Lref: 

cmp.l 

bge 

move.l 

trail 

move.l 

move.l 

brn 

(Xi,  Lref,  Lnref) 

#stack_base,Xi 

1 

H.(AO) 

Xi,A0 

H,Xi 

H,(H)-k 

write_unify_ 

5(0) 

O-fO  (1/0/0) 
1/3 

3  (0/0/1) 

4(0) 

0  (0/0/0) 

4  (0/0/1) 

3 

12  (0) 

2+4  (1/0/0) 

4/6 

4  (0/0/1) 

20  (0) 

2  (0/0/0) 

4 (0/0/1) 

6 

17(5) 

3+ 5  (1/2/1) 

5/0 

5 (0/1/1) 

28(7) 

3 (0/1/0) 

5 (0/1/1) 

0 

1: 

move  1 

bclr 

move.l 

bra 

Lnref: 

move.l 

write_unify_ 

Xi.DO 

#cdr,D0 

D0,(H)-i- 

unify_write_ 

Xi,(H)-r 

0  (0/0/0) 

1  (0/0/0) 

4  (0/0/1) 

3 

6  (0) 

4  (0/0/1) 

2  (0/0/0) 

4  (0/0/0) 

4 (0/0/1) 

6 

12  (0) 

4 (0/0/1) 

3 (0/1/0) 

5 (0/1/0) 

5 (0/1/1) 

0 

16(3) 

5 (0/1/1) 

cycles 

d  time  to  dereference 
t  time  to  trail 

Xi  nref 
ref  nmove 

ref  move 

10+ d  (1) 

16+ d  (1) 
20+d+t  (2) 

16+ d  (1) 

40+ d  (1) 

58+  d+  t  (2) 

21+  d  (5) 

56+  d  (15) 

80+  d+  t  (22) 

input: 

output: 

function 


unify_v»iuc_X 


Xi  in  memory 


unify  Xi  in  write  mode 


Instruction 

best 

cxche 

wont 

w_unify_v»IueX: 

move.l 

Xi(MP),Dl 

3 (1/0/0) 

7  (1/0/0) 

#(1/2/0) 

dtrtftrtnce 

(Dl.  Lref,  Lnref) 

Lref: 

5(0) 

12(0) 

n  (5) 

cmp.l 

#st»ck_b»se,Dl 

0+0  (1/0/0) 

2+4  (1/0/0) 

3+5  (1/2/0) 

bgt 

1 

1/3 

4/6 

5/9 

move.l 

H,{A0) 

3 (0/0/1) 

4  (0/0/1) 

5  (0/1/1) 

trail 

D1,A0 

4(0) 

20  (0) 

28  (7) 

move.l 

H,Xi(MP) 

3  (0/0/1) 

S  (0/0/1) 

7  (0/1/1) 

move.l 

H,(H)+ 

4  (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

brn 

write_unify 

3 

6 

6 

1: 

move.l 

Dl,Xi(MP) 

3 (0/0/1) 

5  (0/0/1) 

7 (0/1/1) 

bclr 

#cdr,Dl 

1 (0/0/0) 

4  (0/0/0) 

5 (0/1/0) 

brn 

2 

3 

6 

e 

Lnref: 

6(0) 

12(0) 

16  (3) 

move.l 

Dl,Xi(MP) 

3  (0/0/1) 

5  (0/0/1) 

7  (0/1/1) 

2: 

move.l 

Dl,(H)-e 

4  (0/0/1) 

4  (0/0/1) 

5  (0/1/1) 

write_unify_ 

cycles 

nref 

16+ d  (3) 

28+  d  (3) 

37+ d  (9) 

ref  nmove 

22+  d  (3) 

62+  d  (3) 

70+ d  (19) 

ref  move 

26+d+t(4) 

70+  d+  t  (8) 

94+d+t  (26) 

d  time  to  derefence 
t  time  to  tnil 
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unify_v»luf_Y 


isput: 

outpnt: 

fuDCtioD 


Instruction _ 

w_nnify_v»lueY: 

move.l 

derefence 

Lref; 

cmp.l 

bge 

move.l 

trnil 

move.l 

move.l 

brn 

1: 

bclr 

move.l 

brn 

Lnref: 

move.l 

iirite_unify_ 

cycles 


d  time  for  dereference 


unify  Yi  in  write 


b«t 

cache 

wont 

-f*i(E),D0 

3 (1/0/0) 

7  (1/0/0) 

9 (1/2/0) 

(DO,  Lref,  Lnref) 

5(0) 

12(0) 

17  (5) 

#st»ck_bi3e,D0 

0+0  (0/0/0) 

2+  4  (0/0/0) 

3+  5  (0/2/0) 

1 

1/3 

4/6 

5/9 

H,(A0) 

3 (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

DO,AO 

4(0) 

20  (0) 

28  (7) 

H,-f‘i(E) 

3 (0/0/1) 

S (0/0/1) 

7 (0/1/1) 

H,(H)+ 

4  (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

write_unify_ 

3 

6 

9 

#cdr,D0 

1  (0/0/0) 

4 (0/0/0) 

5 (0/1/0) 

DO,(H)-t- 

4 (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

nnify_write_ 

3 

6 

9 

6(0) 

12(0) 

16  (3) 

D0,(H)+ 

4 (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

ref  move 

26+  d  (4) 

68+  d  (4) 

93+d  (24) 

ref  no  move 

19+  d  (2) 

55+  d  (2) 

62+ d  (13) 

nref 

13+ d  (1) 

23+ d  (1) 

30+ d  (5) 

input: 

output: 

function 

Constant  C 

unify  C  in  write  mode 

Instruction 

best 

cache 

worst 

w_unify.constiDtC: 

move.l 

write  unify_ 

#C,(H)+ 

4  (0/0/1) 

8  (0/0/1) 

7 (0/1/1) 

cycles 

0  (1) 

8(1) 

7(2) 

input: 

output: 

function: 

Yi 

unify  Yi  with  the  cdr  of  i  list 

in  write  mode 

Instruction 

best 

cache 

worst 

w_unify_cdrY: 

niove.l 

baet 

move! 

move.l 

H,D1 

#cdr,Dl 

Dl,(H)+ 

Dl,-4*i(E) 

0  (0/0/0) 

1 (0/0/0) 

4  (0/0/1) 

3 (0/0/1) 

2  (0/0/0) 

4  (0/0/0) 

4  (0/0/1) 

5  (0/0/1) 

S (0/1/0) 

S (0/1/0) 

5 (0/1/1) 

7 (0/1/1) 

cycles 

8(2) 

15(0) 

20  (6) 

w  unify_nil: 

move! 

#cdr_nil,(H)+ 

4  (0/0/1) 

8  (0/0/1) 

7 (0/1/1) 

cycles 

4(1) 

8  (1) 

7(2) 

input: 

output: 

function: 


Instruction 


w_unify_void: 
move! 
unifv_write_ 


unify_void  _ 


create  unnamed  variable  on  the  heap _ 


best  cache  worst 


H,(H)+  4(0/0/!)  4  (0/0/1)  S  (0/1/1) 
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A.7.  Escape  Routines 


escape  cempariiOB 


ispnt: 
ostput: 
temps:  DO,  Dl 

fuBciton: _ 

lastructioB 

e8c»pe_cmp/2: 

itrt/trtnct 

Laref: 

switcb.tag’ 

Lc: 

btst 

bae 

bra 

Lis: 

move! 

brs 

move! 

_coBtiBue. 

itrtjtrtntt 

Larefl: 

Bsritch_tags 

Lcl: 

cmp.l 

bacc 

Its 

List 

move! 

exg 

brs 

emp  I 

bacc 

Its 

Lfail: 

move! 

bra 


cycle 


d  time  aeeded 
e  time  aeeded 


(DO.  Lfail,  Laref) 


#2.DO 

Lfail 

_coBtinue 

D1,-(PDL) 

eval 

(PDL)-e  ,D1 


(Dl, Lfail, Larefl) 

4 

(Dl,Llsl,Lfail,Lel,Llsl)  4 


D0,-(PDL) 

DO.Dl 

eval 

(PDL)+  ,D1 
Lfail 


pdlBase(MP),PDU 
Jail _ _ 


coast, coast 
struc.coast 
coast, struct 
struct, struct 


to  derefereace 
to  evaluate  the  expressioas 


best 

cache 

worst 

4(0) 

12(0) 

16  (3) 

4(0) 

18  (0) 

25 

1  (0/0/0) 

4  (0/0/0) 

5 (0/1/0) 

1/3 

4/6 

5/6 

3 

6 

6 

3 (0/0/1) 

s  (0/0/1) 

6 (0/1/1) 

6 (0/0/1) 

7  (0/0/1) 

13  (0/2/1) 

4  (1/0/0) 

«  (1/0/0) 

7  (1/1/0) 

4 

10 

12 

4 

18 

25 

0  (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

1/3 

4/6 

S/e 

6  (1/0/0) 

10  (1/0/0) 

12  (1/2/0) 

3 (0/0/1) 

5  (0/0/1) 

6 (0/1/1) 

0 (0/0/0) 

2  (0,/0/0) 

3 (0/1/0) 

5 (0/0/1) 

7  (0/0/1) 

13 (0/2/1) 

0+4  (1/0/0)  2+4  (1/0/0) 

1  3+ 4  (1/1/0) 

1/3 

4/6 

5/6 

8  (1/0/0) 

10(1/0/0) 

12  (1/2/0) 

3 (1/0/0) 

7  (1/0/0) 

6  (1/1/0) 

3 

6 

6 

31+ d 

86+  d 

113-1-  d 

38+  d+  e 

80+  d+  e 

120+  d+  e 

43+  d+  e 

104+  d+  e 

136-  d+  e 

50+  d+  e 

108-1  d-^  e 

146+  d+  e 

DO,  D1 


e»c*pO»/2  (DO,  Dl) 


iopnt: 

ootpnt: 

function: 


Instruction 

esc»pO>/2: 

itrtjertnce 

Lnref: 

twitchjtog 

Lc: 

btst 

bnc 

bn 

Lis: 

move  I 

brs 

move! 

_continue: 

itrtjtTtnct 

Lrefl: 

bset 

btst 

bne 

bclr 

1; 

move  I 

trail 

rts 

Lnref  1 
emp  I 
bne 
rts 

cycles 


evniuate  DO  If  D1  is  reference  type,  nssign  it  tke  result 
else  compare  it  with  the  result  _ 


best 

cache 

worst 

(DO,  Lfail,  Lnref) 

4 

10 

12 

(DO,  Lis,  Lfail,  Lc,  Lis) 

A 

18 

2S 

#2,D0 

1  (0/0/0) 

4 (0/0/0) 

5 (0/1/0) 

Lfail 

1/3 

4/6 

S/8 

_continue 

3 

6 

0 

D1,-(PDL) 

3 (0/0/1) 

3 (0/0/1) 

6 (0/1/1) 

eval 

S  (0/0/1) 

7 (0/0/1) 

13  (0/2/1) 

(PDL)+  ,D1 

4  (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

(Dl,  Lrefl,  Lnrefl) 

4 

14 

17 

#cdr,DO 

1 (0/0/0) 

4 (0/0/0) 

S (0/1/0) 

#cdr,Dl 

1  (0/0/0) 

4  (0/0/0) 

S (0/1/0) 

1 

1/3 

4/6 

5/8 

#cdr,D0 

1 (0/0/0) 

4  (0/0/0) 

5 (0/1/0) 

DO,(AO) 

3  (0/0/1) 

4 (0/0/1) 

5 (0/1/1) 

D1,A0 

4 

20 

28 

g (1/0/0) 

10  (1/0/0) 

12(1/2/0) 

4 

10 

12 

D0,D1 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

Lfail 

1/3 

4/6 

6/9 

W  (1/0/0) 

10  (1/0/0) 

12 (1/2/0) 

const, const 

27-1-  d 

68-t-d 

88 -t  d 

const, struct 

34-t-  d-t-  e 

72-t-d-t-e 

0S-^d-^e 

var, const 

51-t-  d 

104t-  d 

137 1-  d 

var, struct 

S8-^  d 

108-t-  d-t-  e 

144+  d+  e 

iostnictioD 

best 

cache 

worst 

MULl: 

move! 

XI, Dl 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

idqIb  w 

XS.Dl 

2S  (0/0/0) 

27  (0/0/0) 

28(0/1/0) 

bra 

_eDd 

3  (0/0/0) 

6 (0/0/0) 

8  (0/2/0) 

D!V1 

move! 

XI. Dl 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

div9.1 

X3,D1 

54  (0/0/0) 

56  (0/0/0) 

56(0/1/0) 

and.l 

#quotmnsk,Dl 

0+  0  (0/0/0) 

2+  4  (0/0/0) 

3+  5  (0/2/0) 

bra 

_end 

3 (0/0/0) 

6 (0/0/0) 

6  (0/2/0) 

MODI: 

movf  1 

XI, Dl 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

divs  1 

X3,D1 

54  (0/0/0) 

56  (0/0/0) 

56  (0/1/0) 

move  1 

#16, DO 

0 (0/0/0) 

6 (0/0/0) 

5  (0/1/0) 

Isr.l 

D0,D1 

3 (0/0/0) 

6 (0/0/0) 

6  (0/1/0) 

end:  ... 

Isl  1 

#3,D1 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

#2,D1 

0+  0  (0/0/0) 

2+  2  (0/0/0) 

3+3  (0/2/0) 

move  1 

XO.DO 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

itrtjtrtnet 

(DO,Lref,Lnref) 

deref/b 

deref/c 

deref/w 

Lref: 

4 

14 

17 

baet 

#cdr,Dl 

1  (0/0/0) 

4 (0/0/0) 

5 (0/1/0) 

btst 

#cdr,D0 

1  (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

boe 

2 

1/3 

4/6 

5/8 

bclr 

#cdr,Dl 

l(0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

2; 

move! 

D1,(A0) 

3 (0/0/1) 

4  (0/0/1) 

5 (0/1/1) 

irail 

DO, AO 

4+  t 

20+ t 

28+ t 

rts 

8 (1/0/0) 

10  (1/0/0) 

12  (1/2/0) 

Lsref; 
cmp  1 

D0,D1 

0 (0/0/0) 

2 (0/0/0) 

3 (0/1/0) 

bne 

Uil 

1/3  (0/0/0) 

4/6  (0/0/0) 

5/8 

rts 

8  (1/0/0) 

10  (1/0/0) 

12  (1/2/0) 

cycles 

des  vir 

47+  d+  t 

148+  d+  t 

186+  d+  t 

des  const 

36+  d+  t 

110+  d+  t 

140+  d+  t 

add  or  sub 

3 

10 

15 

mul 

28 

35 

40 

div 

57 

70 

76 

mod 

57 

70 

70 

d  time  to  dereference 
t  time  to  triil 

-  63  - 


spot: 

ostput: 

temps: 

foBction: 


Instruction 


evni  DO 


DO 

DO 

PDL,  AO.  Al,  DO,  D1 

eviluste  DO  »nd  put  the  result  in  DO 

This  is  s  recursive  subroutine _ 


best 


cache 


worst 


eval; 

itrtfertnct 

Lnref: 

iwitch_tag 

LcO; 

btst 

bne 

rts 

LIO: 

and  b 

movea  1 

move! 

move! 

dtcir 

Lcnr 

cmp.l 

beq 

Lf: 

move! 
bra 
LsO: 
and  b 
movea  I 
move! 
and  w 
bne 
move  I 
move.  I 
move! 
brs 

move  I 
move! 
move.  I 
brs 
Isr.l 

movea  I 
move.! 


(DO,  Lf,  Lnref) 

(DO.Lf.LIO.LcO.LsO) 

#3, DO 
Lf 


Oxfc.DO 
D0,A0 
(A0)+  ,D0 
(A0),D1 

(Dl,Lf,Lcnr,Lf) 

#cdr_nil,Dl 

eval 

pdlBase(MP),PDL 

Jail 

Oxfc.DO 
D0,A0 
(A0)+  .DO 
#arity2,D0 
Lf 

(A0)+  .-(PDL) 
(AO)+  .-(PDL) 
(AO)-i-  .DO 

eval 

(PDL)-*-  ,D1 
DO.-(PDL) 

Dl.DO 
eval 
#3, DO 
D0,A1 
(PDL)-t-  .DO 


i 

10 

12 

4 

18 

25 

1  (0/0/0) 

4  (0/0/0) 

6 (0/1/0) 

1/3 

4/6 

5/8 

0 (1/0/0) 

10(1/0/0) 

12(1/2/0) 

OH-  0  (0/0/0) 

2-I-  2  (0/0/0) 

3-1  3  (0/2/0) 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

t  (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

3 (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

« 

12 

16 

O-f  0  (0/0/0) 

2-e  4  (0/0/0) 

3-1  5  (0/2/0) 

1/3 

4/6 

5/8 

3  (1/0/0) 

7 (1/0/0) 

8  (1/1/0) 

3 (0/0/0) 

6 (0/0/0) 

8 (0/2/0) 

0+  0  (0/0/0) 

2-f  2  (0/0/0) 

31  3  (0/2/0) 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

4 (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

0-+  0  (0/0/0) 

2-t-  2  (0/0/0) 

31  3  (0/2/0) 

1/3 

4/6 

5/8 

6  (1/0/1) 

7  (1/0/1) 

8  (1/1/1) 

6  (1/0/1) 

7  (1/0/1) 

8  (1/1/1) 

4  (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

5 (0/0/1) 

7  (0/0/1) 

13(0/2/1) 

4  (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

3 (0/0/1) 

6 (0/0/1) 

6  (0/1/1) 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

5  (0/0/1) 

7  (0/0/1) 

13(0/2/1) 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

3  (1/0/0) 

6  (1/0/0) 

7  (1/1/0) 

evil  (cont ) 

iDst  ruction 

best 

cacbe 

worst 

iir.l 

#3,D1 

1  (0/0/0) 

4 (0/0/0) 

4  (0/1/0) 

move  1 

(PDL)+.D1 

S (1/0/0) 

6  (1/0/0) 

7 (1/1/0) 

move.) 

#16, AO 

0 (0/0/0) 

6  (0/0/0) 

5 (0/1/0) 

Isr.l 

AO.Dl 

3 (0/0/0) 

6  (0/0/0) 

6  (0/1/0) 

jmp 

T»ble 

(Table, PD, Dl.b) 

1+  3  (0/0/0) 

4+  6  (0/0/0) 

7+6  (0/2/0) 

»dd: 

ADD 

lub; 

SUB 

mul: 

MUL 

div: 

DIV 

mod 

MOD 

ADD: 

add.I 

A1,D0 

0 (0/0/0) 

2 (0/0/0) 

S (0/1/0) 

br» 

_end 

3  (0/0/0) 

6 (0/0/0) 

9 (0/2/0) 

SUBl: 
snb  1 

A1,D0 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

brn 

_end 

3 (0/0/0) 

6 (0/0/0) 

9 (0/2/0) 

MULl; 

muls  ir 

A1,D0 

25  (0/0/0) 

27  (0/0/0) 

28  (0/1/0) 

bn 

_end 

3  (0/0/0) 

6 (0/0/0) 

9 (0/2/0) 

DIVl; 

divs.l 

A1,D0 

54  (0/0/0) 

56  (0/0/0) 

56  (0/1/0) 

asd.l 

#qnotmask,DO 

0+0  (0/0/0) 

2+  4  (0/0/0) 

3+5  (0/2/0) 

bra 

_end 

3 (0/0/0) 

6 (0/0/0) 

9  (0/2/0) 

MODI: 

divs.l 

Al.DO 

54  (0/0/0) 

56  (0/0/0) 

56 (0/1/0) 

move! 

#16, D1 

0 (0/0/0) 

6 (0/0/0) 

S(0/I/0) 

Isr.l 

Dl.DO 

3 (0/0/0) 

6 (0/0/0) 

6  (0/1/0) 

end 

4  (0/1/0) 

Isl  1 

#3, DO 

1  (0/0/0) 

4 (0/0/0) 

or  b 

#2, DO 

0+  0  (0/0/0) 

2+  2  (0/0/0) 

3+3  (0/2/0) 

rts 

9 (1/0/0) 

10  (1/0/0) 

12  (1/2./0) 

cycles 

const 

19+  d 

46+  d 

59+  d 

list 

27+  c+  d+  e 

70+  c+  d+  e 

93+  c+  d+  e 

struc 

71  +  d+  e 

l57-f  d+  f 

202+  d+  e 

add,  sub 

_3 

8 

12 

mul 

28 

33 

37 

div 

57 

68 

73 

mod 

57 

68 

67 

c  time  to  decdr  for  lists 
d  time  to  dereference 
t  time  to  trail 
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add,  mb,  mol _ 

iapnt: 

XI,  X2  -  input  argument 

XS  -  remit 

oatpnt: 

faaction: 

XI  and  X2  can  only  be  integera,  and  XS  may  be  a 
variable  or  an  integer  - 

InatmctioD 

beat  cache 

wont 

add 

itrtftrtncc 

Laref: 

te»t_intefer 

itrtftrtnct 

Larefl: 

leftjateger 

move  I 

move.) 

l9ll 

Isll 

asr.l 

asr.l 

add) 

I9II 
orb 
move.  I 
move.  I 
bra 

.coatiaue: 


(XI, Jail, Laref) 

(XI  Jail) 

(X2  Jail, Larefl) 

(X2,Jail) 

XI, DO 

X2.D1 

#1,D0 

#1.D1 

#4, DO 

#4,D1 

D1,D0 

#3.D0 

#2, DO 

X3,D1 

PDL,pdlBa9e(MP) 
aaifyl 


6(0) 

12(0) 

16(3) 

1/3 

16/16 

23/27 

6(0) 

12(0) 

16  (3) 

1/3 

16/18 

23/27 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

0  (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

1 (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

0 (0/0/0) 

2  (0/0/0) 

3 (0/1/0) 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

0 (0/0/0) 

4  (0/0/0) 

6  (0/2/0) 

0  (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

1  3 (0/0/1) 

5 (0/0/1) 

7  (0/1/1) 

5  (0/0/1) 

7  (0/0/1) 

13(0/2/1) 

50+  d+  n-6  (2) 

100+  d+  n-12  (2) 

136+  d+  n-l( 

30+  d+  n-6  (2) 

100+  <)■+  tt-12  (2) 

1  StH"  d'+‘ 

55+  d+  u-6  (2) 

125+  d+  u-12  (2) 

161+  d+  n-l( 

cyclea 


add 

•ub 

mul  w 


d  time  to  dereference 
n  time  to  nnify 


c 


L 


k. 


w 
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div,  mod 

ispat: 

ontpot: 

function: 

Xl,  X2  -  input  nifoment 

XS  -  result 

XI  nnd  X2  cnn  only  be  integera,  nnd  X3  m»y  be  n 
variable  or  nn  integer 

Instruction 

best 

cnche 

worst 

div: 

itrtjtrtnct 

(XI  ,_f»il, Lnref) 

Lnref: 

6(0) 

12  (0) 

16(3) 

teitjnteger 

(Xl,J»il) 

1/3 

16/18 

23/27 

itrtjtrtnet 

{X2,Jnil,Lnrefl) 

Lnref  1: 

6(0) 

12(0) 

16(3) 

tett_inteiter 

(X2,_fnil) 

1/3 

16/18 

23/27 

move! 

XI, DO 

0  (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

move! 

X2,D1 

0 (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

Is!  1 

#1,D0 

1 (0/0/0) 

i  (0/0/0) 

4  (0/1/0) 

IsM 

#1,DI 

1  (0/0/0) 

i  (0/0/0) 

4  (0/1/0) 

nsr.l 

#t,D0 

1 (0/0/0) 

i  (0/0/0) 

4  (0/1/0) 

&rl.l 

#<,D1 

1  (0/0/0) 

i  (0/0/0) 

4  (0/1/0) 

div.w 

Dl.DO 

hi  (0/0/0) 

56  (0/0/0) 

56 (0/1/0) 

8W&P 

DO 

1  (0/0/0) 

A  (0/0/0) 

4  (0/1/0) 

clr.w 

DO 

0 (0/0/0) 

2  (0/0/0) 

3  (0/1/0) 

sirip 

DO 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

ext  w 

DO 

1  (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

Isl.l 

#3. DO 

1 (0/0/0) 

4  (0/0/0) 

4  (0/1/0) 

bcrl 

#cdr,D0 

1  (0/0/0) 

4  (0/0/0) 

5  (0/1/0) 

orb 

#0x2, DO 

0 (0/0/0) 

4  (0/0/0) 

6  (0/2/0) 

move! 

X3,D1 

0  (0/0/0) 

2 (0/0/0) 

3  (0/1/0) 

move  1 

PDL,pdlBise(MP) 

3 (0/0/1) 

5  (0/0/1) 

7  (0/1/1) 

brs 

nnifyl 

5 (0/0/1) 

7  (0/0/1) 

13(0/2/1) 

_continue 

cycles 

div 

89+  d+  u-6  (2) 

176+  d+  u-12  (2) 

213+  d+  u-16  (41) 

mod 

88+  d+  u-6  (2) 

172+  d+  u-12  (2) 

209-^  d^  u-16  (40) 

d  time  to  dereference 

0  time  to  unify 
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Inpot; 

romoiler  CompBnsoBS 

XI,  X2  -  both  c*D  OBly  be  integen  or  itrtjtrtneei  to  mtegen 

ostpst; 

fnBClioB: 

Test  to  see  if  the  cood.tioD  hold,  proceed  or 

ftil  Bccordmgly 

worst 

UstrBction 

comp 

(Xl,_Uil,Lnref) 

best 

rftre/trence 

6  (0) 

12(0) 

16(3) 

Laref: 

ttitjnteg«r 

1/3 

16/18 

23/27 

dtrtjertnct 

{X2,_f»il,l-Brefl) 

6  (0) 

12  (0) 

16  (3) 

Usrefl: 
testjBttjer 
move! 
move  t 

Isll 

Itl.l 
cmp  1 

(X2,J»il) 

XI, DO 

X2,D1 

#1,D0 

#1.D1 

D0,D1 

fill 

1/3 

0  (0/0/0) 

0 (0/0/0) 

1  (0/0/0) 

1  (0/0/0) 

0 (0/0/0) 

1/3 

16/18 

2 (0/0/0) 

2  (0/0/0) 
t  (0/0/0) 

4  (0/0/0) 

2 (0/0/0) 
4/6 

23/27 

3 (0/1/0) 

3 (0/1/0) 

4 (0/1/0) 

4 (0/1/0) 

3  (0/1/0) 

3/0 

cycles 

true 

filse 

17  (0) 

19(0) 

76  (0) 

78  (0) 

95  (24) 

99  (25) 

-M- 


A.ppcDdix  B.  Benchmark  Result*. 

Table  1.  Cycle  Count  Ratio  (MC680S0  vs.  VLSI-PLM) 


prog  s&ZD« 

ptm 

68020  (best) 

68020  (ciche) 

68020  (worst) 

chit  (Mulder/Tick) 

1.0  t 

2'13 

7.71 

neb  (Mulder/Tick) 

1.0 

2M 

6  42 

boyer  (Muld«r/Tick) 

1.0 

1  86 

4  43 

5.71 

chit  (Pitt/Chen)  • 

1.0 

S13 

667 

8  33 

boyer  (Pitt/Chen)  • 

10 

2.30 

5.22 

6.82 

browse 

10 

3.00 

6.23 

8.55 

coni 

10 

2.71 

6  94 

9  16 

CO&6 

1.0 

360 

7  91 

10  35 

differen 

10 

2  68 

6  11 

8  05 

hinoi 

10 

27B 

5  70 

7  50 

mnmith 

10 

3  13 

675 

8  74 

nrevl 

10 

271 

7  01 

9  38 

pillD 

1.0 

3.09 

7  23 

9  40 

pri2 

lot 

2.29 

3  03 

3  49 

q?4 

10 

2  32 

5  09 

6  73 

3  30 

6  99 

9  10 

I  20 

2  01 

!  3  95 

t  Timing  IS  reported 
model,  which  hid  i  fi 

in  111  is  ictuilly  for  their  piml 
jied  sue  choice  point  frime  ind  i  single 

cycle  memory 

.  Cycle  count  ritios  for  chit  ind  boyer  ire  not  compirible  to  those 
obtiined  in  [l]  since  their  numbers  ire  for  their  pim  model  while 
ours  ire  for  the  most  up-to-dite  version  of  the  VLSI-PLM 

t  The  cycle  counts  for  the  PLM  ire  cilculited  with  the  issumption  thit 
in  intemilly  coupled  MC68020  is  used  to  eviluite  the  eseopei  Therefore, 
cycle  count  is  different  for  eich  of  the  three  cises,  depending  on  how  the 
MC68020  eviluiles  the  escipes,  i  e.,  whether  best,  ciche,  or  worst  cise 

ipplies 

1 
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T.bl.  t.  ReUtlv.  P.rf.,m..c  of  Pr.J«t»i  MCMOJO.  ...  10  MH.  VLSI-PLM  » 


qurry  (brjl) 
query  (c»che) 
queiy  (wor;t 


t  The  PLM  timing  for  the  estipes  ire  cnIcuUted  under  the  njsuroption 
of  »n  inleniilly  coupled  MC680;0  running  *t  the  correoponding  cloch  rnte 


I  R.l.liv,  U  .o„,p.ud  „  .h.  r.ti.  of  'XSI  PLM  fim.  divid.d  b, 

the  projected  execution  time  of  the  processor  indicated  in  that  column. 


I 
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Table 


Relative  Performance  of  Projected  MCft8020t  va. 


lft.7  MH»  VLSI-PLM  t 


ProKrim 

PLM 

MC68020 

16  7  MHs 

16  7  MHi 

25  MHi 

30  MHs 

10  MHi 

coni  (best) 
coni  (cache) 
coni  (worst) 

10 

0  37 

0  11 

0  11 

0  56 

0  22 

0  16 

0  67 

026 

0  20 

0.89 

0  35 

0.27 

con6  (best) 
con6  (cache) 
con6  (worst) 

1.0 

0.29 

0  13 

0  10 

013 

0  20 

0  15 

052 

0  21 

0  18 

069 

0.32 

0.2  •! 

differen  (best) 
dilferen  (cache) 
differen  (worst) 

10 

037 

0  16 

0  12 

0  55 

0  21 

0  18 

066 

0  28 

0.22 

0.88 

0  37 

0  29 

hanoi  (best) 
hanoi  (cache) 
hanoi  (worst) 

10 

0  35 

0  17 

0  13 

0  53 

0  26 

0  20 

0  61 

0  31 

0,21 

085 

Oil 

0  32 

mum&tb  (b«st) 

0  32 

0  18 

0  57 

076 

0  36 

0  28 

muzDith  (c&chc) 
mumith  (wont) 

1  0 

0.15 

0  11 

0.22 

0  17 

0  27 

0  21 

nrevl  (best) 
nrevl  (cache) 
nrevl  (worst) 

10 

0  37 

0  11 

0.11 

0  55 

0  21 

0  16 

0  66 

0  26 

0.19 

0  88 

0.35 

0.25 

palin  (best) 
palm  (cache) 
palm  (worst) 

10 

0  31 

0  15 

0.11 

0  51 

0  22 

0  17 

0  61 

0  26 

0  20 

081 

035 

0,27 

pti2  (best) 
pri2  (cache) 
pn2  (worst) 

10  t 

oil 

0  33 

0.29 

0  56 

0  39 

0  31 

0.63 

0  13 

0  37 

0  78 

0.51 

013 

qst  (best) 
qs4  (cache) 

10 

013 

020 

0  61 

0  29 

0  77 

0  35 
n  77 

1  03 

0  17 

0  36 

qs4  (worst) 

0  15 

queens  (best) 

0  31 

0  16 

0  65 

0  '’R 

073 

0  35 

queens  (cache) 
queens  (worst) 

1  0 

0  15 

0  11 

0  17 

020 

0.27 

query  (best) 
query  (cache) 

10  t 

050 

0  30 

021 

0  82 

0  39 

0  31 

0  95 

015 

0  35 

1.22 

0  57 

0  15 

~ 

■  t  The  PLM  Itmmg  is  calculated  by  assum.n*  an  mtemally  couplea  Mt.ooo.v, 
running  at  the  corresponding  clock  rate _ _ _ _ _ _ _ J 

1  Relative  performance 
the  projected  execution 


is  computed  as  the  ratio  of  simulated  VLSI-PLM 
time  of  the  processor  indicated  in  that  column. 


execution 


time  divided  by 


Table  4.  Calculated  ve.  Real  Cyclee  for  MC68020 


Propam 

Best 

CalculaUd 

Cache 

Worst 

Real  Time  on  Tower 
Macro  Expand  Quintus 

on  SUN  3/200 
Macro  Expand 

coni 

SOI 

1280 

1700 

1670 

1000 

1400 

conO 

3030 

eooo 

8040 

8000 

8400 

8020 

banoi 

130000 

278000 

301000 

300000 

367000 

336000 

nrevl 

67100 

148000 

108000 

106000 

141000 

181000 

qs4 

113000 

260000 

838000 

367000 

344000 

306000 

que«ns 

100000 

223000 

280000 

327000 

232000 

208000 

Rrmarkt: 


Benchmark  propin*  run  on  th*  Tower/32  suffer  »t  lenst  one  wut 
stnte  for  ench  d»t»  nccess.  On  the  other  hnnd,  the  propurs  run 
on  the  SUN  3/200  c»n  npproech  the  idee!  MC08020  speed  if  every  nrcess 

results  in  n  cache  hit. 

There  are  optinurations  in  the  Quintus  system,  as  demonstrated  by  the 
results  for  nrevl  and  queens. _ 
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Table  6.  Code  Sise  Comparleon 


